摘要

1. HotelRec: a Novel Very Large-Scale Hotel Recommendation Dataset [PDF] 返回目录
Diego Antognini, Boi Faltings
Abstract: Today, recommender systems are an inevitable part of everyone's daily digital routine and are present on most internet platforms. State-of-the-art deep learning-based models require a large number of data to achieve their best performance. Many datasets fulfilling this criterion have been proposed for multiple domains, such as Amazon products, restaurants, or beers. However, works and datasets in the hotel domain are limited: the largest hotel review dataset is below the million samples. Additionally, the hotel domain suffers from a higher data sparsity than traditional recommendation datasets and therefore, traditional collaborative-filtering approaches cannot be applied to such data. In this paper, we propose HotelRec, a very large-scale hotel recommendation dataset, based on TripAdvisor, containing 50 million reviews. To the best of our knowledge, HotelRec is the largest publicly available dataset in the hotel domain (50M versus 0.9M) and additionally, the largest recommendation dataset in a single domain and with textual reviews (50M versus 22M). We release HotelRec for further research: this https URL.
摘要：今天，推荐系统是每个人的日常数字常规不可避免的一部分，并且存在于大多数互联网平台。国家的最先进的深基础的学习的模型需要大量的数据来实现其最佳性能。许多数据集满足这个标准已经被提出了多个域名，如Amazon的产品，餐馆或啤酒。然而，工作和数据集在酒店领域是有限的：最大的三星级酒店的数据集是万个样本下方。另外，从比传统的建议的数据集，并且因此更高的数据稀疏的酒店域患有，传统的协作过滤方法不能被应用到这样的数据。在本文中，我们提出HotelRec，一个很大规模的酒店推荐数据集的基础上，TripAdvisor的，含有5000万条评论。据我们所知，HotelRec是在酒店领域最大的可公开获得的数据集（50M与0.9M），另外，在一个域中，并用文字评论（50M与22M）最大的推荐数据集。我们发布HotelRec进一步研究：此HTTPS URL。

2. GameWikiSum: a Novel Large Multi-Document Summarization Dataset [PDF] 返回目录
Diego Antognini, Boi Faltings
Abstract: Today's research progress in the field of multi-document summarization is obstructed by the small number of available datasets. Since the acquisition of reference summaries is costly, existing datasets contain only hundreds of samples at most, resulting in heavy reliance on hand-crafted features or necessitating additional, manually annotated data. The lack of large corpora therefore hinders the development of sophisticated models. Additionally, most publicly available multi-document summarization corpora are in the news domain, and no analogous dataset exists in the video game domain. In this paper, we propose GameWikiSum, a new domain-specific dataset for multi-document summarization, which is one hundred times larger than commonly used datasets, and in another domain than news. Input documents consist of long professional video game reviews as well as references of their gameplay sections in Wikipedia pages. We analyze the proposed dataset and show that both abstractive and extractive models can be trained on it. We release GameWikiSum for further research: this https URL.
摘要：今天的多文档文摘领域的研究进展是由少数可用的数据集的阻碍。由于参考摘要的获取是昂贵的，现有数据集仅包含数百个样品至多，导致在手工制作的特征或迫使额外的，手动注释的数据重依赖。因此，缺乏大型语料库的阻碍复杂的模型的开发。此外，大多数公开可用的多文档文摘语料库在新闻领域，并没有类似的数据集在视频游戏领域的存在。在本文中，我们提出GameWikiSum，一个新的域特定的数据集的多文档文摘，这比通常使用的数据集较大的一个百倍，而在另一个领域不是新闻。输入文件包括长期专业视频游戏评论以及在维基百科网页游戏的部分的引用。我们分析所提出的数据集，并表明，无论是抽象和采掘模型可以在其上进行训练。我们发布GameWikiSum进一步研究：此HTTPS URL。

3. Incorporating BERT into Neural Machine Translation [PDF] 返回目录
Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin, Wengang Zhou, Houqiang Li, Tie-Yan Liu
Abstract: The recently proposed BERT has shown great power on a variety of natural language understanding tasks, such as text classification, reading comprehension, etc. However, how to effectively apply BERT to neural machine translation (NMT) lacks enough exploration. While BERT is more commonly used as fine-tuning instead of contextual embedding for downstream language understanding tasks, in NMT, our preliminary exploration of using BERT as contextual embedding is better than using for fine-tuning. This motivates us to think how to better leverage BERT for NMT along this direction. We propose a new algorithm named BERT-fused model, in which we first use BERT to extract representations for an input sequence, and then the representations are fused with each layer of the encoder and decoder of the NMT model through attention mechanisms. We conduct experiments on supervised (including sentence-level and document-level translations), semi-supervised and unsupervised machine translation, and achieve state-of-the-art results on seven benchmark datasets. Our code is available at \url{this https URL}.
摘要：最近提出BERT已经在各种自然语言理解任务，如文本分类，阅读理解等。然而，如何有效地应用BERT神经机器翻译（NMT）缺乏足够的勘探显示出巨大的威力。虽然BERT是比较常用的微调，而不是下游语言理解任务的上下文嵌入在NMT，我们作为背景嵌入比使用微调更好地使用BERT的初步探索。这促使我们思考如何更好地利用BERT为NMT沿着这个方向发展。我们提出了一个名为BERT融合模式新的算法，在此我们首先使用BERT提取表示对于输入序列，然后表示是通过关注机制NMT模型的编码器和解码器的每一层融合。我们在开展监督实验（包括句子级和文件级转换），半监督和无人监督的机器翻译，实现国家的最先进的七个标准数据集的结果。我们的代码是可以在\ {URL这HTTPS URL}。

4. Multi-layer Representation Fusion for Neural Machine Translation [PDF] 返回目录
Qiang Wang, Fuxue Li, Tong Xiao, Yanyang Li, Yinqiao Li, Jingbo Zhu
Abstract: Neural machine translation systems require a number of stacked layers for deep models. But the prediction depends on the sentence representation of the top-most layer with no access to low-level representations. This makes it more difficult to train the model and poses a risk of information loss to prediction. In this paper, we propose a multi-layer representation fusion (MLRF) approach to fusing stacked layers. In particular, we design three fusion functions to learn a better representation from the stack. Experimental results show that our approach yields improvements of 0.92 and 0.56 BLEU points over the strong Transformer baseline on IWSLT German-English and NIST Chinese-English MT tasks respectively. The result is new state-of-the-art in German-English translation.
摘要：神经机器翻译系统需要大量的深模型叠层。但预测依赖于最顶层的，没有获得低级别表示句子表示。这使得它更难以培养模式，造成信息损失预测的风险。在本文中，我们提出了一种多层表示融合（MLRF）的方法来熔合叠层。特别是，我们设计了三种融合功能，学习从堆栈中一个更好的代表性。实验结果表明，分别，我们在上IWSLT德文 - 英语和NIST中国英语MT任务的强大的变压器基线的0.92和0.56 BLEU点的方法产生的改进。其结果是新的国家的最先进的德国英文翻译。

5. Gaussian Smoothen Semantic Features (GSSF) -- Exploring the Linguistic Aspects of Visual Captioning in Indian Languages (Bengali) Using MSCOCO Framework [PDF] 返回目录
Chiranjib Sur
Abstract: In this work, we have introduced Gaussian Smoothen Semantic Features (GSSF) for Better Semantic Selection for Indian regional language-based image captioning and introduced a procedure where we used the existing translation and English crowd-sourced sentences for training. We have shown that this architecture is a promising alternative source, where there is a crunch in resources. Our main contribution of this work is the development of deep learning architectures for the Bengali language (is the fifth widely spoken language in the world) with a completely different grammar and language attributes. We have shown that these are working well for complex applications like language generation from image contexts and can diversify the representation through introducing constraints, more extensive features, and unique feature spaces. We also established that we could achieve absolute precision and diversity when we use smoothened semantic tensor with the traditional LSTM and feature decomposition networks. With better learning architecture, we succeeded in establishing an automated algorithm and assessment procedure that can help in the evaluation of competent applications without the requirement for expertise and human intervention.
摘要：在这项工作中，我们已经介绍了高斯平滑语义特征（GSSF）为更好的语义选择对印度地方语言基础的图像和字幕介绍我们用于训练现有的翻译和英语人群来源的句子的过程。我们已经证明这种体系结构是一个有前途的替代来源，那里是在资源危机。我们这项工作的主要贡献是为孟加拉语深度学习架构（是世界第五广泛的语言）使用完全不同的语法和语言特性的发展。我们已经表明，这些从图像背景复杂的应用程序，如语言生成运作良好，可以通过引入约束，更丰富的功能，以及独特的功能空间多样化的表现。我们还建立了，当我们使用平滑的语义张量与传统LSTM和功能分解的网络，我们可以实现绝对精度和多样性。有了更好的学习结构，我们成功地建立自动算法和评估程序，在主管应用的评价可以帮助没有专业知识和人为干预的要求。

6. Speech Corpus of Ainu Folklore and End-to-end Speech Recognition for Ainu Language [PDF] 返回目录
Kohei Matsuura, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara
Abstract: Ainu is an unwritten language that has been spoken by Ainu people who are one of the ethnic groups in Japan. It is recognized as critically endangered by UNESCO and archiving and documentation of its language heritage is of paramount importance. Although a considerable amount of voice recordings of Ainu folklore has been produced and accumulated to save their culture, only a quite limited parts of them are transcribed so far. Thus, we started a project of automatic speech recognition (ASR) for the Ainu language in order to contribute to the development of annotated language archives. In this paper, we report speech corpus development and the structure and performance of end-to-end ASR for Ainu. We investigated four modeling units (phone, syllable, word piece, and word) and found that the syllable-based model performed best in terms of both word and phone recognition accuracy, which were about 60% and over 85% respectively in speaker-open condition. Furthermore, word and phone accuracy of 80% and 90% has been achieved in a speaker-closed setting. We also found out that a multilingual ASR training with additional speech corpora of English and Japanese further improves the speaker-open test accuracy.
摘要：阿伊努人是已被阿伊努人谁是日本的民族之一说出一个不成文的语言。如暴联合国教科文组织濒危和归档和语言遗产的文档是非常重要的它是公认的。尽管相当数量的阿伊努民间传说的录音已产生和积累，以挽救他们的文化，只有他们的十分有限的部分至今转录。因此，我们为了促进注释文档案的开发开始自动语音识别（ASR）的阿伊努语的项目。在本文中，我们报道了语料库发展的结构和终端到终端的ASR为阿伊努人的表现。我们查处4个建模单位（电话，音节，字块，和字），发现基于音节的模型字和电话识别准确度，其分别约为60％和85％以上的条件进行最好的扬声器开条件。此外，字和80％的手机准确性和90％已在扬声器关闭的设置来实现的。我们还发现，多语种ASR训练，英语和日语的附加语音语料库进一步提高了扬声器，开放测试的准确性。

7. The Utility of General Domain Transfer Learning for Medical Language Tasks [PDF] 返回目录
Daniel Ranti, Katie Hanss, Shan Zhao, Varun Arvind, Joseph Titano, Anthony Costa, Eric Oermann
Abstract: The purpose of this study is to analyze the efficacy of transfer learning techniques and transformer-based models as applied to medical natural language processing (NLP) tasks, specifically radiological text classification. We used 1,977 labeled head CT reports, from a corpus of 96,303 total reports, to evaluate the efficacy of pretraining using general domain corpora and a combined general and medical domain corpus with a bidirectional representations from transformers (BERT) model for the purpose of radiological text classification. Model performance was benchmarked to a logistic regression using bag-of-words vectorization and a long short-term memory (LSTM) multi-label multi-class classification model, and compared to the published literature in medical text classification. The BERT models using either set of pretrained checkpoints outperformed the logistic regression model, achieving sample-weighted average F1-scores of 0.87 and 0.87 for the general domain model and the combined general and biomedical-domain model. General text transfer learning may be a viable technique to generate state-of-the-art results within medical NLP tasks on radiological corpora, outperforming other deep models such as LSTMs. The efficacy of pretraining and transformer-based models could serve to facilitate the creation of groundbreaking NLP models in the uniquely challenging data environment of medical text.
摘要：本研究的目的是应用于医疗自然语言处理（NLP）的任务，特别是放射性文本分类分析的转移学习技术和基于变压器模型的功效。我们使用1977份标头CT报告，从96303个总报告文集，以评估使用通用领域语料库和与双向交涉组合一般和医疗领域的语料库从变压器（BERT）模型放射文本的目的训练前的功效分类。模型的性能用袋的词矢量和长短期记忆（LSTM）多品牌多类分类模型的基准测试在回归，并与医疗文本分类出版的文献。使用任一组预训练的检查点的BERT模型优于逻辑回归模型，实现0.87和0.87的样品加权平均值F1-分数一般域模型，并将合并的一般和生物医学域模型。一般文本传输学习可能是一个可行的技术产生内的放射性语料库医疗NLP任务，优于其他车型深如LSTMs国家的先进成果。训练前和基于变压器模型的功效可以有助于促进医疗文本的独特挑战数据环境开创性的NLP模型的创建。

8. SBERT-WK: A Sentence Embedding Method by Dissecting BERT-based Word Models [PDF] 返回目录
Bin Wang, C.-C. Jay Kuo
Abstract: Sentence embedding is an important research topic in natural language processing (NLP) since it can transfer knowledge to downstream tasks. Meanwhile, a contextualized word representation, called BERT, achieves the state-of-the-art performance in quite a few NLP tasks. Yet, it is an open problem to generate a high quality sentence representation from BERT-based word models. It was shown in previous study that different layers of BERT capture different linguistic properties. This allows us to fusion information across layers to find better sentence representation. In this work, we study the layer-wise pattern of the word representation of deep contextualized models. Then, we propose a new sentence embedding method by dissecting BERT-based word models through geometric analysis of the space spanned by the word representation. It is called the SBERT-WK method. No further training is required in SBERT-WK. We evaluate SBERT-WK on semantic textual similarity and downstream supervised tasks. Furthermore, ten sentence-level probing tasks are presented for detailed linguistic analysis. Experiments show that SBERT-WK achieves the state-of-the-art performance. Our codes are publicly available.
摘要：句子嵌入在自然语言处理（NLP）的重要研究课题，因为它可以的知识转移给下游任务。同时，情境化的单词表示，所谓的BERT，实现了相当多的NLP任务的国家的最先进的性能。然而，这是生成基于BERT字款高品质的句子表达一个开放的问题。结果表明在以前的研究认为BERT捕捉不同的语言特性的不同层。这使我们能够跨层融合的信息，以找到更好的句子表示。在这项工作中，我们研究了深情境模型的字表示的逐层模式。然后，我们提出了一个新的句子通过用单词表示围成的空间的几何分析解剖基础BERT组字模式嵌入方法。这就是所谓的SBERT-WK方法。没有进一步的培训是必需的SBERT-WK。我们对语义文本相似性和下游监督任务评估SBERT-WK。此外，十句级探测任务提出了详细的语言分析。实验表明，SBERT-WK实现国家的最先进的性能。我们的代码是公开的。

9. Towards Detection of Subjective Bias using Contextualized Word Embeddings [PDF] 返回目录
Tanvi Dadu, Kartikey Pant, Radhika Mamidi
Abstract: Subjective bias detection is critical for applications like propaganda detection, content recommendation, sentiment analysis, and bias neutralization. This bias is introduced in natural language via inflammatory words and phrases, casting doubt over facts, and presupposing the truth. In this work, we perform comprehensive experiments for detecting subjective bias using BERT-based models on the Wiki Neutrality Corpus(WNC). The dataset consists of $360k$ labeled instances, from Wikipedia edits that remove various instances of the bias. We further propose BERT-based ensembles that outperform state-of-the-art methods like $BERT_{large}$ by a margin of $5.6$ F1 score.
摘要：主观偏见检测是像宣传的检测，内容推荐，情感分析和偏置中和应用的关键。这种偏见在自然语言通过炎症的单词和短语介绍，让人怀疑过的事实，并预先假定的真相。在这项工作中，我们进行全面的实验检测使用Wiki上中立语料库（WNC）基于BERT的模型主观偏见。该数据集包括$ 360K $标记的情况下，维基百科的编辑是去除偏见的各种实例。我们进一步建议，超越国家的最先进的方法，如$ BERT_基于BERT-合奏{}大$由$ 5.6 $ F1得分的余量。

10. Neural Machine Translation with Joint Representation [PDF] 返回目录
YanYang Li, Qiang Wang, Tong Xiao, Tongran Liu, Jingbo Zhu
Abstract: Though early successes of Statistical Machine Translation (SMT) systems are attributed in part to the explicit modelling of the interaction between any two source and target units, e.g., alignment, the recent Neural Machine Translation (NMT) systems resort to the attention which partially encodes the interaction for efficiency. In this paper, we employ Joint Representation that fully accounts for each possible interaction. We sidestep the inefficiency issue by refining representations with the proposed efficient attention operation. The resulting Reformer models offer a new Sequence-to- Sequence modelling paradigm besides the Encoder-Decoder framework and outperform the Transformer baseline in either the small scale IWSLT14 German-English, English-German and IWSLT15 Vietnamese-English or the large scale NIST12 Chinese-English translation tasks by about 1 BLEU point.We also propose a systematic model scaling approach, allowing the Reformer model to beat the state-of-the-art Transformer in IWSLT14 German-English and NIST12 Chinese-English with about 50% fewer parameters. The code is publicly available at this https URL.
摘要：尽管统计机器翻译的早期成功（SMT）系统在部分归因于任何两个源和目标单位之间相互作用的明确建模，例如，对齐，近期神经机器翻译（NMT）系统诉诸注意哪些部分编码效率的相互作用。在本文中，我们采用联合代表，充分占每个可能的交互。我们精炼后所提出的高效操作注意回避交涉的低效率问题。得到的改革者模型提供了一个新的序列-TO-序列建模范例除了编码器，解码器框架，无论是在小规模IWSLT14德语英语，英语，德语和IWSLT15越南英文或大规模NIST12华裔跑赢基准变压器约1 BLEU一点。我们还提出了一个系统模型缩小方法，使改革者模型击败国家的最先进的变压器在IWSLT14德文 - 英语和NIST12中国英语用更少的约50％的参数英语翻译任务。该代码是公开的，在此HTTPS URL。

11. Exploring Neural Models for Parsing Natural Language into First-Order Logic [PDF] 返回目录
Hrituraj Singh, Milan Aggrawal, Balaji Krishnamurthy
Abstract: Semantic parsing is the task of obtaining machine-interpretable representations from natural language text. We consider one such formal representation - First-Order Logic (FOL) and explore the capability of neural models in parsing English sentences to FOL. We model FOL parsing as a sequence to sequence mapping task where given a natural language sentence, it is encoded into an intermediate representation using an LSTM followed by a decoder which sequentially generates the predicates in the corresponding FOL formula. We improve the standard encoder-decoder model by introducing a variable alignment mechanism that enables it to align variables across predicates in the predicted FOL. We further show the effectiveness of predicting the category of FOL entity - Unary, Binary, Variables and Scoped Entities, at each decoder step as an auxiliary task on improving the consistency of generated FOL. We perform rigorous evaluations and extensive ablations. We also aim to release our code as well as large scale FOL dataset along with models to aid further research in logic-based parsing and inference in NLP.
摘要：语义分析是获得从自然语言文本的机器可解释表示的任务。我们认为这样的一个正式代表 - 一阶逻辑（FOL），并探讨分析英语句子FOL神经模型的能力。我们模型FOL解析作为序列到序列映射任务，其中给定的自然语言句子，它被编码为使用LSTM随后由相应的式FOL其中依次生成的谓词的解码器的中间表示。我们通过引入一个可变对准机构，它能够使在预测FOL跨越谓词对准变量改进的标准编码器 - 解码器模型。我们进一步显示预测FOL实体的类别的有效性 - 一元，二元，变量和实体作用域，在每个解码器步骤，作为有关改善产生FOL的一致性的辅助任务。我们执行严格的评估和广泛的消融。我们还致力于与模型，以帮助进一步研究在NLP基于逻辑分析和推理沿着释放我们的代码，以及大规模数据集FOL。

12. Learning to Generate Multiple Style Transfer Outputs for an Input Sentence [PDF] 返回目录
Kevin Lin, Ming-Yu Liu, Ming-Ting Sun, Jan Kautz
Abstract: Text style transfer refers to the task of rephrasing a given text in a different style. While various methods have been proposed to advance the state of the art, they often assume the transfer output follows a delta distribution, and thus their models cannot generate different style transfer results for a given input text. To address the limitation, we propose a one-to-many text style transfer framework. In contrast to prior works that learn a one-to-one mapping that converts an input sentence to one output sentence, our approach learns a one-to-many mapping that can convert an input sentence to multiple different output sentences, while preserving the input content. This is achieved by applying adversarial training with a latent decomposition scheme. Specifically, we decompose the latent representation of the input sentence to a style code that captures the language style variation and a content code that encodes the language style-independent content. We then combine the content code with the style code for generating a style transfer output. By combining the same content code with a different style code, we generate a different style transfer output. Extensive experimental results with comparisons to several text style transfer approaches on multiple public datasets using a diverse set of performance metrics validate effectiveness of the proposed approach.
摘要：文本样式转移是指在重新表述不同的风格给定文本的任务。虽然各种方法被提出推进技术状态，他们往往承担转移输出遵循增量分布，因而他们的模型无法为给定的输入文本不同的风格转移的结果。为了解决这个限制，我们提出了一个一对多的文本样式转让框架。在对比的是学习一到一个映射之前的作品是将输入句子一个输出一句话，我们的方法学习一个一对多的映射，可以输入句子转换成多个不同的输出语句，同时保留输入内容。这是通过与潜在分解方案将对抗训练来实现的。具体而言，我们分解输入句子的潜在代表性的样式代码捕获的语言风格变化和内容代码编码语言风格无关的内容。然后，我们有用于产生一种风格转移输出样式代码相结合的内容的代码。由于同样的内容代码不同的代码风格结合起来，我们产生不同的风格转移输出。相比较之下几个文本样式转移大量的实验结果上使用一组不同的所提出的方法的性能指标验证有效性的多个公共数据集的方法。

13. A Multimodal Dialogue System for Conversational Image Editing [PDF] 返回目录
Tzu-Hsiang Lin, Trung Bui, Doo Soon Kim, Jean Oh
Abstract: In this paper, we present a multimodal dialogue system for Conversational Image Editing. We formulate our multimodal dialogue system as a Partially Observed Markov Decision Process (POMDP) and trained it with Deep Q-Network (DQN) and a user simulator. Our evaluation shows that the DQN policy outperforms a rule-based baseline policy, achieving 90\% success rate under high error rates. We also conducted a real user study and analyzed real user behavior.
摘要：在本文中，我们提出了一个对话图像编辑的多模式对话系统。我们制定了多式联运对话系统的部分可观测马尔可夫决策过程（POMDP），并与深Q-网络（DQN）和用户培训的模拟器它。我们的评估显示，DQN政策优于基于规则的基准策略，实现在高错误率90 \％的成功率。我们也进行了真实的用户研究和分析真实的用户行为。

14. Deeper Task-Specificity Improves Joint Entity and Relation Extraction [PDF] 返回目录
Phil Crone
Abstract: Multi-task learning (MTL) is an effective method for learning related tasks, but designing MTL models necessitates deciding which and how many parameters should be task-specific, as opposed to shared between tasks. We investigate this issue for the problem of jointly learning named entity recognition (NER) and relation extraction (RE) and propose a novel neural architecture that allows for deeper task-specificity than does prior work. In particular, we introduce additional task-specific bidirectional RNN layers for both the NER and RE tasks and tune the number of shared and task-specific layers separately for different datasets. We achieve state-of-the-art (SOTA) results for both tasks on the ADE dataset; on the CoNLL04 dataset, we achieve SOTA results on the NER task and competitive results on the RE task while using an order of magnitude fewer trainable parameters than the current SOTA architecture. An ablation study confirms the importance of the additional task-specific layers for achieving these results. Our work suggests that previous solutions to joint NER and RE undervalue task-specificity and demonstrates the importance of correctly balancing the number of shared and task-specific parameters for MTL approaches in general.
摘要：多任务学习（MTL）是学习相关的任务，但设计MTL模型就必须决定哪些许多参数应该如何针对特定任务的，而不是任务共享的有效方法。我们正在调查这个问题的共同学习命名实体识别（NER）和关系抽取（RE）的问题，并提出了一种新的神经结构，允许更深的任务特异性比以前做的工作。特别是，我们介绍了NER两者和RE任务和调共享和任务特定层的分别为不同的数据集的数量的附加任务特定双向RNN层。我们实现在ADE数据集两个任务的国家的最先进的（SOTA）的结果;在CoNLL04数据集，我们在使用的数量较少的可训练的参数比现行SOTA架构的订单实现对NER任务，并在RE任务竞争的结果SOTA结果。消融研究证实为实现这些成果的额外任务的特定层的重要性。我们的研究表明联合NER以前的解决方案和RE价值低估任务的特殊性和展示的正确平衡共享和任务特定参数的数量MTL一般方法的重要性。

15. Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping [PDF] 返回目录
Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, Noah Smith
Abstract: Fine-tuning pretrained contextual word embedding models to supervised downstream tasks has become commonplace in natural language processing. This process, however, is often brittle: even with the same hyperparameter values, distinct random seeds can lead to substantially different results. To better understand this phenomenon, we experiment with four datasets from the GLUE benchmark, fine-tuning BERT hundreds of times on each while varying only the random seeds. We find substantial performance increases compared to previously reported results, and we quantify how the performance of the best-found model varies as a function of the number of fine-tuning trials. Further, we examine two factors influenced by the choice of random seed: weight initialization and training data order. We find that both contribute comparably to the variance of out-of-sample performance, and that some weight initializations perform well across all tasks explored. On small datasets, we observe that many fine-tuning trials diverge part of the way through training, and we offer best practices for practitioners to stop training less promising runs early. We publicly release all of our experimental data, including training and validation scores for 2,100 trials, to encourage further analysis of training dynamics during fine-tuning.
摘要：微调预训练的情景字嵌入模式，以监督下游任务已经成为自然语言处理司空见惯。该方法中，然而，通常是脆：即使是相同的值超参数，不同的随机种子可以导致显着不同的结果。为了更好地理解这一现象，我们尝试从胶基准四个数据集，微调BERT数百次每个而仅改变随机种子。我们比较了先前报道的结果发现显着的性能增长，我们量化最好的发现模型的性能如何作为微调的试验数量的变化而变化。此外，我们审查随机种子的选择的影响两个因素：重量初始化和训练数据的顺序。我们发现，这两种同等于外的样本表现的方差贡献，以及一些重初始化所有任务表现良好的探讨。在小的数据集，我们观察到许多微调试验通过培训发散的方式的一部分，我们停止训练提供从业者的最佳做法不太看好初期运行。我们公开发布的所有我们的实验数据，包括2,100试验训练和验证成绩，鼓励的微调过程中培训力度进一步分析。

16. Controlling Computation versus Quality for Neural Sequence Models [PDF] 返回目录
Ankur Bapna, Naveen Arivazhagan, Orhan Firat
Abstract: Most neural networks utilize the same amount of compute for every example independent of the inherent complexity of the input. Further, methods that adapt the amount of computation to the example focus on finding a fixed inference-time computational graph per example, ignoring any external computational budgets or varying inference time limitations. In this work, we utilize conditional computation to make neural sequence models (Transformer) more efficient and computation-aware during inference. We first modify the Transformer architecture, making each set of operations conditionally executable depending on the output of a learned control network. We then train this model in a multi-task setting, where each task corresponds to a particular computation budget. This allows us to train a single model that can be controlled to operate on different points of the computation-quality trade-off curve, depending on the available computation budget at inference time. We evaluate our approach on two tasks: (i) WMT English-French Translation and (ii) Unsupervised representation learning (BERT). Our experiments demonstrate that the proposed Conditional Computation Transformer (CCT) is competitive with vanilla Transformers when allowed to utilize its full computational budget, while improving significantly over computationally equivalent baselines when operating on smaller computational budgets.
摘要：大多数神经网络利用计算的相同量的每一个独立于输入的固有的复杂性的一个例子。此外，该方法适应的计算量的示例集中在寻找每例的固定推理时间的计算曲线图，忽略任何外部计算预算或变化的推理时间的限制。在这项工作中，我们利用的条件计算，使神经序列模型（变压器）更有效率和计算感知的推理过程。我们首先修改变压器的架构，使各组条件执行取决于学习型控制网络的输出操作。然后，我们培养这种模式在多任务环境，在这里每个任务对应一个特定的计算预算。这让我们训练，可以控制对计算质量的权衡曲线的不同点进行操作，这取决于在推理时的可用计算预算的单一模式。我们评估我们的两个任务的方法：（I）WMT英语法语翻译及（ii）无监督学习的表示（BERT）。我们的实验证明，允许利用其全面的计算预算的时候，而在较小的计算预算工作时提高显著在计算等效基准所提出的条件计算变压器（CCT）是香草变形金刚竞争力。

17. Computing rank-revealing factorizations of matrices stored out-of-core [PDF] 返回目录
Nathan Heavner, Per-Gunnar Martinsson, Gregorio Quintana-Ortí
Abstract: This paper describes efficient algorithms for computing rank-revealing factorizations of matrices that are too large to fit in RAM, and must instead be stored on slow external memory devices such as solid-state or spinning disk hard drives (out-of-core or out-of-memory). Traditional algorithms for computing rank revealing factorizations, such as the column pivoted QR factorization, or techniques for computing a full singular value decomposition of a matrix, are very communication intensive. They are naturally expressed as a sequence of matrix-vector operations, which become prohibitively expensive when data is not available in main memory. Randomization allows these methods to be reformulated so that large contiguous blocks of the matrix can be processed in bulk. The paper describes two distinct methods. The first is a blocked version of column pivoted Householder QR, organized as a ``left-looking'' method to minimize the number of write operations (which are more expensive than read operations on a spinning disk drive). The second method results in a so called UTV factorization which expresses a matrix $A$ as $A = U T V^*$ where $U$ and $V$ are unitary, and $T$ is triangular. This method is organized as an algorithm-by-blocks, in which floating point operations overlap read and write operations. The second method incorporates power iterations, and is exceptionally good at revealing the numerical rank; it can often be used as a substitute for a full singular value decomposition. Numerical experiments demonstrate that the new algorithms are almost as fast when processing data stored on a hard drive as traditional algorithms are for data stored in main memory. To be precise, the computational time for fully factorizing an $n\times n$ matrix scales as $cn^{3}$, with a scaling constant $c$ that is only marginally larger when the matrix is stored out of core.
摘要：本文描述了一种用于计算太大，以适应在RAM矩阵的秩揭示因式分解有效的算法，而必须被存储在慢速外部存储设备，诸如固态或旋转磁盘的硬盘驱动器（外的芯或外的存储器）。用于计算秩揭示因式分解，如柱枢转QR分解，或技术来计算矩阵的完整奇异值分解传统算法，都非常通信密集型的。它们天然表达为矩阵向量运算，当数据不在主存储器中可用而变得昂贵的序列。随机化允许重新这些方法使得矩阵的大的连续块可以批量进行处理。本文介绍了两种截然不同的方法。首先是柱的封端的枢转版本的Householder QR，组织为``左视“”的方法以最小化的写入操作的数目（其比纺丝磁盘驱动器上的读操作更贵）。第二种方法导致所谓的UTV因式分解其表达矩阵$ A $为$ A = UŤV ^ * $其中$ U $ $和V $是酉，和$ T $为三角形。该方法被组织为一个算法逐块，其中浮点操作重叠读取和写入操作。第二种方法采用了功率的迭代，并且是在揭示数值秩特别好;它通常可以用来作为一个完整的奇异值分解的替代品。数值试验表明，新的算法几乎一样快，传统的算法用于数据存储在主存储器对存储在硬盘驱动器上的数据时。确切地说，对于完全因式分解的$ N \ n次$矩阵尺度为$ CN ^ {3} $，具有缩放常数$ C $当基质芯的存储指出，只是勉强越大计算时间。

18. Supervised Phrase-boundary Embeddings [PDF] 返回目录
Manni Singh, David Weston, Mark Levene
Abstract: We propose a new word embedding model, called SPhrase, that incorporates supervised phrase information. Our method modifies traditional word embeddings by ensuring that all target words in a phrase have exactly the same context. We demonstrate that including this information within a context window produces superior embeddings for both intrinsic evaluation tasks and downstream extrinsic tasks.
摘要：我们提出了一个新词嵌入模型，称为SPhrase，并入监督短语信息。我们的方法通过确保在一个短语中所有目标词具有完全相同的情况下修改了传统的文字的嵌入。我们证明包括上下文窗口内的这些信息包括内在评价任务和下游外在任务产生高质量的嵌入。

19. Open Knowledge Enrichment for Long-tail Entities [PDF] 返回目录
Ermei Cao, Difeng Wang, Jiacheng Huang, Wei Hu
Abstract: Knowledge bases (KBs) have gradually become a valuable asset for many AI applications. While many current KBs are quite large, they are widely acknowledged as incomplete, especially lacking facts of long-tail entities, e.g., less famous persons. Existing approaches enrich KBs mainly on completing missing links or filling missing values. However, they only tackle a part of the enrichment problem and lack specific considerations regarding long-tail entities. In this paper, we propose a full-fledged approach to knowledge enrichment, which predicts missing properties and infers true facts of long-tail entities from the open Web. Prior knowledge from popular entities is leveraged to improve every enrichment step. Our experiments on the synthetic and real-world datasets and comparison with related work demonstrate the feasibility and superiority of the approach.
摘要：知识库（KBS）已逐渐成为许多AI应用的宝贵财富。虽然目前许多KB的都相当大，他们被公认为是不完整的，特别是缺乏长尾实体的事实，例如，名气不大的人。现有的方法主要富集在完成缺少的环节或填充缺失值KB的。然而，他们只是解决问题的富集的一部分，并没有关于长尾实体的具体考虑。在本文中，我们提出了一个全面的方法来充实知识，这预示缺少的属性和推断从开放的网络长尾实体的事实真相。从流行实体的先验知识利用，以提高每个富集步骤。我们对合成和真实世界的数据集，并与相关工作的比较实验证明了该方法的可行性和优越性。

20. UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation [PDF] 返回目录
Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Xilin Chen, Ming Zhou
Abstract: We propose UniViLM: a Unified Video and Language pre-training Model for multimodal understanding and generation. Motivated by the recent success of BERT based pre-training technique for NLP and image-language tasks, VideoBERT and CBT are proposed to exploit BERT model for video and language pre-training using narrated instructional videos. Different from their works which only pre-train understanding task, we propose a unified video-language pre-training model for both understanding and generation tasks. Our model comprises of 4 components including two single-modal encoders, a cross encoder and a decoder with the Transformer backbone. We first pre-train our model to learn the universal representation for both video and language on a large instructional video dataset. Then we fine-tune the model on two multimodal tasks including understanding task (text-based video retrieval) and generation task (multimodal video captioning). Our extensive experiments show that our method can improve the performance of both understanding and generation tasks and achieves the state-of-the art results.
摘要：本文提出UniViLM：一个统一的视频和语言前的训练模式的多模态的理解和生成。受近期对NLP和图像语言任务，VideoBERT和CBT基于BERT前培训技术的成功激励提出了利用使用讲述了教学视频，视频和语言训练前BERT模式。从他们的作品只预列车任务的理解不同，我们提出了两种理解和生成任务统一视频语言前的训练模式。我们的模型包括4个组分，包括两个单峰编码器，一个编码器的交叉，并与变压器主链的解码器的。我们先预先训练我们的模型来学习的视频和语言上大量的教学视频数据集中的普遍代表性。然后，我们微调在两个多任务，其中包括理解任务（基于文本的视频检索）和生成任务（多视频字幕）模型。我们广泛的实验表明，我们的方法可以提高双方的理解和生成任务的性能和实现国家的艺术效果。

21. Semantic Relatedness and Taxonomic Word Embeddings [PDF] 返回目录
Magdalena Kacmajor, John D. Kelleher, Filip Klubicka, Alfredo Maldonado
Abstract: This paper connects a series of papers dealing with taxonomic word embeddings. It begins by noting that there are different types of semantic relatedness and that different lexical representations encode different forms of relatedness. A particularly important distinction within semantic relatedness is that of thematic versus taxonomic relatedness. Next, we present a number of experiments that analyse taxonomic embeddings that have been trained on a synthetic corpus that has been generated via a random walk over a taxonomy. These experiments demonstrate how the properties of the synthetic corpus, such as the percentage of rare words, are affected by the shape of the knowledge graph the corpus is generated from. Finally, we explore the interactions between the relative sizes of natural and synthetic corpora on the performance of embeddings when taxonomic and thematic embeddings are combined.
摘要：本文连接一系列的处理分类字的嵌入文件。它首先指出，有不同类型的语义关联的，并且不同的词汇表示编码不同形式的相关性。语义关联中特别重要的区别是，专题与分类相关性。接下来，我们提出了一定数量的分析已经培训了已经通过了一个分类随机游走产生的合成语料库分类的嵌入实验。这些实验证明如何合成语料库的性能，如稀有字的百分比，通过从生成的语料库的知识库图形的形状的影响。最后，我们探讨的嵌入性能的天然和合成语料库的相对大小之间的相互作用时，分类和专题的嵌入结合。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-02-18

目录

摘要