摘要

1. A Common Semantic Space for Monolingual and Cross-Lingual Meta-Embeddings [PDF] 返回目录
Iker García, Rodrigo Agerri, German Rigau
Abstract: This paper presents a new technique for creating monolingual and cross-lingual meta-embeddings. Our method integrates multiple word embeddings created from complementary techniques, textual sources, knowledge bases and languages. Existing word vectors are projected to a common semantic space using linear transformations and averaging. With our method the resulting meta-embeddings maintain the dimensionality of the original embeddings without losing information while dealing with the out-of-vocabulary problem. An extensive empirical evaluation demonstrates the effectiveness of our technique with respect to previous work on various intrinsic and extrinsic multilingual evaluations, obtaining competitive results for Semantic Textual Similarity and state-of-the-art performance for word similarity and POS tagging (English and Spanish). The resulting cross-lingual meta-embeddings also exhibit excellent cross-lingual transfer learning capabilities. In other words, we can leverage pre-trained source embeddings from a resource-rich language in order to improve the word representations for under-resourced languages.
摘要：本文提出了创建单语和跨语言间的嵌入的新技术。我们的方法整合了互补技术，文本来源，知识库和语言创建多个字的嵌入。现有字矢量投影到使用线性变换和平均共同语义空间。随着我们的方法所产生的荟萃的嵌入保持原有的嵌入的维度，而不会丢失信息，在处理外的词汇的问题。广泛的实证评价表明了我们的技术相对于各种内在和外在的多语种评估先前的工作成效，获得了语义文本相似性和国家的最先进的性能竞争的结果词语相似度和词性标注（英语和西班牙语）。产生的跨语种元的嵌入也表现出优异的跨语言迁移学习能力。换句话说，我们可以利用从资源丰富的语言预先训练源的嵌入，以提高资源不足的语言文字表述。

2. Modality-Balanced Models for Visual Dialogue [PDF] 返回目录
Hyounghun Kim, Hao Tan, Mohit Bansal
Abstract: The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue. However, via manual analysis, we find that a large number of conversational questions can be answered by only looking at the image without any access to the context history, while others still need the conversation context to predict the correct answers. We demonstrate that due to this reason, previous joint-modality (history and image) models over-rely on and are more prone to memorizing the dialogue history (e.g., by extracting certain keywords or patterns in the context information), whereas image-only models are more generalizable (because they cannot memorize or extract keywords from history) and perform substantially better at the primary normalized discounted cumulative gain (NDCG) task metric which allows multiple correct answers. Hence, this observation encourages us to explicitly maintain two models, i.e., an image-only model and an image-history joint model, and combine their complementary abilities for a more balanced multimodal model. We present multiple methods for this integration of the two models, via ensemble and consensus dropout fusion with shared parameters. Empirically, our models achieve strong results on the Visual Dialog challenge 2019 (rank 3 on NDCG and high balance across metrics), and substantially outperform the winner of the Visual Dialog challenge 2018 on most metrics.
摘要：可视对话任务需要一个模型，同时利用图像和会话的上下文信息来生成到对话的下一个响应。然而，通过人工分析，我们发现了大量的对话问题只能由看图像，而不到上下文历史上的任何访问来回答，而其他人还需要对话上下文来预测正确的答案。我们表明，由于这个原因，以往合资模式（史和图像）模式过分依赖，而且更容易记住的对话记录（例如，通过上下文信息提取的关键字或模式），而只有图象模型更加普及（因为他们无法记住或者从历史中提取的关键字），并在主要贴现归累计收益（NDCG）任务指标，它允许多个正确答案大幅更好地履行。因此，这种观察鼓励我们要明确地保持两种模式，即只有一个影像的模型和图像的历史关节模型，并结合它们的互补能力，为一个更加平衡的多模式模型。我们提出了这种整合两个模型的多种方法，通过与共享参数合奏和共识辍学融合。根据经验，我们的模型实现对视觉对话挑战2019（关于NDCG和整个指标高平衡等级3）强劲的业绩，并基本跑赢视觉对话框挑战2018的大多数指标的赢家。

3. A Hybrid Solution to Learn Turn-Taking in Multi-Party Service-based Chat Groups [PDF] 返回目录
Maira Gatti de Bayser, Melina Alberio Guerra, Paulo Cavalin, Claudio Pinhanez
Abstract: To predict the next most likely participant to interact in a multi-party conversation is a difficult problem. In a text-based chat group, the only information available is the sender, the content of the text and the dialogue history. In this paper we present our study on how these information can be used on the prediction task through a corpus and architecture that integrates turn-taking classifiers based on Maximum Likelihood Expectation (MLE), Convolutional Neural Networks (CNN) and Finite State Automata (FSA). The corpus is a synthetic adaptation of the Multi-Domain Wizard-of-Oz dataset (MultiWOZ) to a multiple travel service-based bots scenario with dialogue errors and was created to simulate user's interaction and evaluate the architecture. We present experimental results which show that the CNN approach achieves better performance than the baseline with an accuracy of 92.34%, but the integrated solution with MLE, CNN and FSA achieves performance even better, with 95.65%.
摘要：为了预测下一个最有可能的参与者进行互动的多方通话是一个棘手的问题。在基于文本的聊天群，唯一可用的信息是发送者，文本和对话历史的内容。在本文中，我们介绍如何将这些信息可以在预测任务中使用通过语料库和架构，集成了转向回吐基于最大似然期望（MLE），卷积神经网络（CNN）和有限状态自动分类（我们的研究FSA ）。该语料库是多域向导的盎司数据集（MultiWOZ）与对话错误多个旅游服务为主的机器人场景的合成适应和创建来模拟用户的交互和评估体系结构。我们这表明，CNN方法实现比92.34％的准确度基准更好的性能，但与MLE，CNN和FSA集成的解决方案实现性能更为出色，有95.65％目前的实验结果。

4. RobBERT: a Dutch RoBERTa-based Language Model [PDF] 返回目录
Pieter Delobelle, Thomas Winters, Bettina Berendt
Abstract: Pre-trained language models have been dominating the field of natural language processing in recent years, and have led to significant performance gains for various complex natural language tasks. One of the most prominent pre-trained language models is BERT (Bi-directional Encoders for Transformers), which was released as an English as well as a multilingual version. Although multilingual BERT performs well on many tasks, recent studies showed that BERT models trained on a single language significantly outperform the multilingual results. Training a Dutch BERT model thus has a lot of potential for a wide range of Dutch NLP tasks. While previous approaches have used earlier implementations of BERT to train their Dutch BERT, we used RoBERTa, a robustly optimized BERT approach, to train a Dutch language model called RobBERT. We show that RobBERT improves state of the art results in Dutch-specific language tasks, and also outperforms other existing Dutch BERT-based models in sentiment analysis. These results indicate that RobBERT is a powerful pre-trained model for fine-tuning for a large variety of Dutch language tasks. We publicly release this pre-trained model in hope of supporting further downstream Dutch NLP applications.
摘要：预先训练语言模型已经主宰自然语言处理领域在最近几年，并导致显著的性能提升各种复杂的自然语言的任务。其中最突出的预先训练语言模型是BERT（变形金刚双向编码器），它被发布了作为一个英语和一个多语种的版本。虽然多语种BERT执行以及对许多任务，最近的研究显示，培训了一个单一的语言，BERT模型显著跑赢多种语言的结果。培训荷兰BERT模型因而具有广泛的荷兰NLP任务很大的潜力。虽然以前的方法已使用BERT的早期实现培养他们的荷兰BERT，我们使用了罗伯塔，一个稳健优化BERT的方法，培养所谓的RobBERT荷兰语言模型。我们发现，RobBERT改善状态荷兰人特有的语言任务的艺术效果，而且在情感分析优于其他现有的基于BERT荷模型。这些结果表明，RobBERT是微调功能强大的预先训练的模型种类繁多的荷兰语任务。我们在公开支持进一步的下游荷兰NLP应用希望释放此预先训练模式。

5. Multi-step Joint-Modality Attention Network for Scene-Aware Dialogue System [PDF] 返回目录
Yun-Wei Chu, Kuan-Yen Lin, Chao-Chun Hsu, Lun-Wei Ku
Abstract: Understanding dynamic scenes and dialogue contexts in order to converse with users has been challenging for multimodal dialogue systems. The 8-th Dialog System Technology Challenge (DSTC8) proposed an Audio Visual Scene-Aware Dialog (AVSD) task, which contains multiple modalities including audio, vision, and language, to evaluate how dialogue systems understand different modalities and response to users. In this paper, we proposed a multi-step joint-modality attention network (JMAN) based on recurrent neural network (RNN) to reason on videos. Our model performs a multi-step attention mechanism and jointly considers both visual and textual representations in each reasoning process to better integrate information from the two different modalities. Compared to the baseline released by AVSD organizers, our model achieves a relative 12.1% and 22.4% improvement over the baseline on ROUGE-L score and CIDEr score.
摘要：了解动态场景和对话的上下文，以便与用户交谈已具有挑战性的多模态对话系统。 8个对话系统技术挑战（DSTC8）提出了一个视听场景感知对话框（AVSD）任务，其中包含多种方式，包括音频，视觉和语言，以评估对话系统是如何理解不同的方式和响应用户。在本文中，我们提出了一种基于递归神经网络（RNN）一个多步骤的联合方式关注网络（JMAN）理性上的视频。我们的模型进行多步注意机制，共同考虑在每个推理过程视觉和文本表示，以更好的信息从两种不同的方式进行整合。相比于通过AVSD主办方公布的基线，我们的模型实现了对ROUGE-L分和苹果酒得分基线相对12.1％和22.4％的改善。

6. Plato Dialogue System: A Flexible Conversational AI Research Platform [PDF] 返回目录
Alexandros Papangelis, Mahdi Namazifar, Chandra Khatri, Yi-Chia Wang, Piero Molino, Gokhan Tur
Abstract: As the field of Spoken Dialogue Systems and Conversational AI grows, so does the need for tools and environments that abstract away implementation details in order to expedite the development process, lower the barrier of entry to the field, and offer a common test-bed for new ideas. In this paper, we present Plato, a flexible Conversational AI platform written in Python that supports any kind of conversational agent architecture, from standard architectures to architectures with jointly-trained components, single- or multi-party interactions, and offline or online training of any conversational agent component. Plato has been designed to be easy to understand and debug and is agnostic to the underlying learning frameworks that train each component.
摘要：口语对话系统和会话人工智能领域的增长，确实需要工具和环境，为了加快开发进程，降低进入该领域的障碍，并提供一个共同的测试 - 抽象掉的实施细则床上躺了新的思路。在本文中，我们目前柏拉图，灵活的对话AI平台用Python编写的，它支持任何类型的会话代理架构，从标准架构与联合训练的成分，单或多方互动，以及离线或在线培训体系任何会话代理组件。柏拉图已经被设计成易于理解和调试，并是不可知的是培养每个组件的基础学习框架。

7. Supervised Speaker Embedding De-Mixing in Two-Speaker Environment [PDF] 返回目录
Yanpei Shi, Thomas Hain
Abstract: In this work, a speaker embedding de-mixing approach is proposed. Instead of separating two-speaker signal in signal space like speech source separation, the proposed approach separates different speaker properties from two-speaker signal in embedding space. The proposed approach contains two steps. In step one, the clean speaker embeddings are learned and collected by a residual TDNN based network. In step two, the two-speaker signal and the embedding of one of the speakers are input to a speaker embedding de-mixing network. The de-mixing network is trained to generate the embedding of the other speaker of the by reconstruction loss. Speaker identification accuracy on the de-mixed speaker embeddings is used to evaluate the quality of the obtained embeddings. Experiments are done in two kind of data: artificial augmented two-speaker data (TIMIT) and real world recording of two-speaker data (MC-WSJ). Six diffident speaker embedding de-mixing architectures are investigated. Comparing with the speaker identification accuracy on the clean speaker embeddings (98.5%), the obtained results show that one of the speaker embedding de-mixing architectures obtain close performance, reaching 96.9% test accuracy on TIMIT when the SNR between the target speaker and interfering speaker is 5 dB. More surprisingly, we found choosing a simple subtraction as the embedding de-mixing function could obtain the second best performance, reaching 95.2% test accuracy.
摘要：在这项工作中，扬声器嵌入脱混合方法提出。代替在如语音源分离的信号分离空间两个扬声器信号的，所提出的方法分离两个扬声器信号在嵌入空间中的不同扬声器的特性。所提出的方法包括两个步骤。在第一步中，清洁扬声器的嵌入被学习和由残余基于TDNN网络收集。在步骤2中，两个扬声器信号和扬声器中的一个的嵌入被输入到扬声器中嵌入解混合网络。所述去混合网络进行训练，以产生的另一个扬声器的由重建丢失的嵌入。对解混合扬声器的嵌入扬声器识别精度被用于评估所获得的嵌入的质量。实验以两种类型的数据来完成：人工增强双扬声器数据（TIMIT）和双扬声器数据的真实世界记录（MC-WSJ）。六个心虚音箱嵌入脱混合体系结构进行了研究。与在干净的扬声器的嵌入扬声器识别精度（98.5％）相比较，所获得的结果表明，该扬声器中的一个嵌入脱混合架构获得紧密的性能，上TIMIT达到96.9％测试精度当目标讲话者和干扰之间的SNR扬声器为5dB。更令人惊讶的，我们发现选择一个简单的减法作为嵌入脱混合功能可以得到第二最佳性能，达到95.2％的测试精度。

8. On- Device Information Extraction from Screenshots in form of tags [PDF] 返回目录
Sumit Kumar, Gopi Ramena, Manoj Goyal, Debi Mohanty, Ankur Agarwal, Benu Changmai, Sukumar Moharana
Abstract: We propose a method to make mobile screenshots easily searchable. In this paper, we present the workflow in which we: 1) preprocessed a collection of screenshots, 2) identified script presentin image, 3) extracted unstructured text from images, 4) identifiedlanguage of the extracted text, 5) extracted keywords from the text, 6) identified tags based on image features, 7) expanded tag set by identifying related keywords, 8) inserted image tags with relevant images after ranking and indexed them to make it searchable on device. We made the pipeline which supports multiple languages and executed it on-device, which addressed privacy concerns. We developed novel architectures for components in the pipeline, optimized performance and memory for on-device computation. We observed from experimentation that the solution developed can reduce overall user effort and improve end user experience while searching, whose results are published.
摘要：我们建议让移动截图易于搜索的方法。在本文中，我们提出我们在其中工作流：1）预处理截图的集合，2）识别的脚本presentin图像，3）提取从图像非结构化文本，4）提取的文本的identifiedlanguage，5）提取的关键词从文本，6）的基础上的图像特征识别的标签，7）膨胀通过识别相关的关键字标签集，8）与相关图像插入的图像标签的排名后和索引他们，使其可检索在设备上。我们做了哪些支持多种语言流水线开始执行它的设备，其中涉及隐私问题。我们开发新的架构在管线，优化的性能和内存设备上的计算组件。我们从实验观察到，解决方案开发可降低整体用户的努力和改善最终用户体验，同时搜索，其结果公布。

9. User-in-the-loop Adaptive Intent Detection for Instructable Digital Assistant [PDF] 返回目录
Nicolas Lair, Clément Delgrange, David Mugisha, Jean-Michel Dussoux, Pierre-Yves Oudeyer, Peter Ford Dominey
Abstract: People are becoming increasingly comfortable using Digital Assistants (DAs) to interact with services or connected objects. However, for non-programming users, the available possibilities for customizing their DA are limited and do not include the possibility of teaching the assistant new tasks. To make the most of the potential of DAs, users should be able to customize assistants by instructing them through Natural Language (NL). To provide such functionalities, NL interpretation in traditional assistants should be improved: (1) The intent identification system should be able to recognize new forms of known intents, and to acquire new intents as they are expressed by the user. (2) In order to be adaptive to novel intents, the Natural Language Understanding module should be sample efficient, and should not rely on a pretrained model. Rather, the system should continuously collect the training data as it learns new intents from the user. In this work, we propose AidMe (Adaptive Intent Detection in Multi-Domain Environments), a user-in-the-loop adaptive intent detection framework that allows the assistant to adapt to its user by learning his intents as their interaction progresses. AidMe builds its repertoire of intents and collects data to train a model of semantic similarity evaluation that can discriminate between the learned intents and autonomously discover new forms of known intents. AidMe addresses two major issues - intent learning and user adaptation - for instructable digital assistants. We demonstrate the capabilities of AidMe as a standalone system by comparing it with a one-shot learning system and a pretrained NLU module through simulations of interactions with a user. We also show how AidMe can smoothly integrate to an existing instructable digital assistant.
摘要：人们使用数字助理（DAS）与服务或连接的对象进行交互变得越来越舒适。然而，对于非编程的用户，定制自己的DA可用的可能性是有限的，不包括教学助理新任务的可能性。为了充分利用的DA的潜力，用户应该能够通过自然语言（NL），指示他们定制的助手。为了提供这样的功能，在传统的助理NL解释应加以改进：（1）意图识别系统应该能够识别已知的意图的新形式，因为它们是由用户表达了收购意向新。（2）为了适应新的意图，所述自然语言理解模块应该是样品高效，并且不应该依赖于预训练的模型。相反，因为它学习来自用户的新意图，系统应不断收集训练数据。在这项工作中，我们提出AidMe（在多域环境自适应意图检测），用户在半实物自适应意图检测框架，允许助手通过学习他的意图及其互进步，以适应其用户。 AidMe建立其意图和收集数据的剧目来训练语义相似性评价的模型，可以和所学意图区分自主发现已知意图的新形式。 AidMe地址两大问题 - 意向学习和适应用户 - 对于造说明数字助理。我们通过将其与一次性学习系统，并通过与用户的交互的模拟预训练NLU模块比较表明AidMe的能力，作为一个独立的系统。我们还表明AidMe如何平滑地集成到现有的造说明数字助理。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-01-20

目录

摘要