摘要

1. The Sockeye 2 Neural Machine Translation Toolkit at AMTA 2020 [PDF] 返回目录
Tobias Domhan, Michael Denkowski, David Vilar, Xing Niu, Felix Hieber, Kenneth Heafield
Abstract: We present Sockeye 2, a modernized and streamlined version of the Sockeye neural machine translation (NMT) toolkit. New features include a simplified code base through the use of MXNet's Gluon API, a focus on state of the art model architectures, distributed mixed precision training, and efficient CPU decoding with 8-bit quantization. These improvements result in faster training and inference, higher automatic metric scores, and a shorter path from research to production.
摘要：我们提出红大马哈鱼2，红大马哈鱼神经机器翻译（NMT）工具的现代化，精简版。的新功能包括通过使用MXNet的胶子API，一个专注于现有技术模型体系结构的状态的简化的代码的基础上，分布混合精度培训和效率的CPU与8位量化的解码。这些改进导致更快的训练和推理，自动较高度量标准得分，并从研发到生产更短的路径。

2. Revisiting Low Resource Status of Indian Languages in Machine Translation [PDF] 返回目录
Jerin Philip, Shashank Siripragada, Vinay P. Namboodiri, C.V. Jawahar
Abstract: Indian language machine translation performance is hampered due to the lack of large scale multi-lingual sentence aligned corpora and robust benchmarks. Through this paper, we provide and analyse an automated framework to obtain such a corpus for Indian language neural machine translation (NMT) systems. Our pipeline consists of a baseline NMT system, a retrieval module, and an alignment module that is used to work with publicly available websites such as press releases by the government. The main contribution towards this effort is to obtain an incremental method that uses the above pipeline to iteratively improve the size of the corpus as well as improve each of the components of our system. Through our work, we also evaluate the design choices such as the choice of pivoting language and the effect of iterative incremental increase in corpus size. Our work in addition to providing an automated framework also results in generating a relatively larger corpus as compared to existing corpora that are available for Indian languages. This corpus helps us obtain substantially improved results on the publicly available WAT evaluation benchmark and other standard evaluation benchmarks.
摘要：印度语言机器翻译的性能受到阻碍，由于缺乏大规模多国语言句子对齐语料库和稳健的基准。通过本文，我们提供和分析的自动化框架，以获得印第安语神经机器翻译（NMT）系统这样的语料库。我们的管道由基线NMT系统，检索模块，以及用于与公开可用的网站工作，如新闻稿政府的定位模块。对这项工作的主要贡献是获得使用上述管道迭代改善阴茎的大小，以及提高我们每一个系统组件的增量方法。通过我们的工作，我们也评估设计选择，如回转语言的选择和迭代的效果语料规模逐步增加。我们除了工作，以提供一个自动化的框架，也导致了相比于可用于印度语语料库存在产生相对较大的语料库。该语料库可以帮助我们获得关于可公开获得的WAT评价标准和其他标准的评价基准大幅提高的结果。

3. LTIatCMU at SemEval-2020 Task 11: Incorporating Multi-Level Features for Multi-Granular Propaganda Span Identification [PDF] 返回目录
Sopan Khosla, Rishabh Joshi, Ritam Dutt, Alan W Black, Yulia Tsvetkov
Abstract: In this paper we describe our submission for the task of Propaganda Span Identification in news articles. We introduce a BERT-BiLSTM based span-level propaganda classification model that identifies which token spans within the sentence are indicative of propaganda. The ``multi-granular'' model incorporates linguistic knowledge at various levels of text granularity, including word, sentence and document level syntactic, semantic and pragmatic affect features, which significantly improve model performance, compared to its language-agnostic variant. To facilitate better representation learning, we also collect a corpus of 10k news articles, and use it for fine-tuning the model. The final model is a majority-voting ensemble which learns different propaganda class boundaries by leveraging different subsets of incorporated knowledge.\footnote{Our final ensemble attains $4^{th}$ position on the test leaderboard. Our final model and code is released at \url{this https URL}.
摘要：在本文中，我们描述了在新闻宣传部跨度鉴定的任务我们的提交。我们引入基于BERT-BiLSTM跨级宣传分类模型指出哪些句子中的令牌跨度是表示宣传。该``多粒度'模型包含各级文本粒度的，包括字，句子和文档级句法，语义和语用功能的影响，这显著提高模型的性能相比，它的语言无关的变种语言知识。为了便于更好的代表性学习，我们也收集的10K新闻文章的语料库，并用它进行微调模型。最终模型是一个多数投票合奏其通过利用并入知识的不同子集学习不同宣传类边界。\ {脚注我们最后的合奏$无所获4 ^ {}第上的测试排行榜$位置。我们的最终模型和代码在\ {URL这HTTPS URL}被释放。

4. Hybrid Ranking Network for Text-to-SQL [PDF] 返回目录
Qin Lyu, Kaushik Chakrabarti, Shobhit Hathi, Souvik Kundu, Jianwen Zhang, Zheng Chen
Abstract: In this paper, we study how to leverage pre-trained language models in Text-to-SQL. We argue that previous approaches under utilize the base language models by concatenating all columns together with the NL question and feeding them into the base language model in the encoding stage. We propose a neat approach called Hybrid Ranking Network (HydraNet) which breaks down the problem into column-wise ranking and decoding and finally assembles the column-wise outputs into a SQL query by straightforward rules. In this approach, the encoder is given a NL question and one individual column, which perfectly aligns with the original tasks BERT/RoBERTa is trained on, and hence we avoid any ad-hoc pooling or additional encoding layers which are necessary in prior approaches. Experiments on the WikiSQL dataset show that the proposed approach is very effective, achieving the top place on the leaderboard.
摘要：在本文中，我们研究如何预先训练杠杆语言模型的文本到SQL。我们认为下通过与NL问题连接所有列在一起，并将它们送入在编码阶段的基本语言模型利用基本语言模型，以前的方法。我们建议称为混合排名网（HydraNet），它打破了问题转化为逐列居和解码整齐的办法，最后组装逐列输出成简单的规则的SQL查询。在此方法中，编码器被赋予一个NL问题和一个单独的列，它完美地与原始任务BERT /罗伯塔被训练上对齐，因此我们避免任何的ad-hoc池或状态所必要在现有方法中附加编码层。在WikiSQL数据集上，该方法是非常有效的，实现在领先榜上排名前列的实验。

5. A Neural Generative Model for Joint Learning Topics and Topic-Specific Word Embeddings [PDF] 返回目录
Lixing Zhu, Yulan He, Deyu Zhou
Abstract: We propose a novel generative model to explore both local and global context for joint learning topics and topic-specific word embeddings. In particular, we assume that global latent topics are shared across documents, a word is generated by a hidden semantic vector encoding its contextual semantic meaning, and its context words are generated conditional on both the hidden semantic vector and global latent topics. Topics are trained jointly with the word embeddings. The trained model maps words to topic-dependent embeddings, which naturally addresses the issue of word polysemy. Experimental results show that the proposed model outperforms the word-level embedding methods in both word similarity evaluation and word sense disambiguation. Furthermore, the model also extracts more coherent topics compared with existing neural topic models or other models for joint learning of topics and word embeddings. Finally, the model can be easily integrated with existing deep contextualized word embedding learning methods to further improve the performance of downstream tasks such as sentiment classification.
摘要：本文提出了一种新生成模型，探讨共同学习主题和特定主题字的嵌入本地和全球范围内的。特别是，我们认为全球潜在主题是跨文档共享，通过编码它的上下文语义隐藏语义向量生成的一个词，它的上下文词的语义隐藏向量和全球潜在主题都产生条件。主题字的嵌入联合训练。训练的模型映射的话主题相关的嵌入，这自然解决多义词的问题。实验结果表明，该模型优于字级嵌入在这两个词相似性评价和词义消歧方法。此外，与现有的神经主题模型或其他模型的主题和字的嵌入的共同学习相比，该机型还提取更连贯的主题。最后，该模型可以方便地与现有深语境字嵌入学习方法，进一步提高下游任务，如情感分类的综合性能。

6. A Comparison of Synthetic Oversampling Methods for Multi-class Text Classification [PDF] 返回目录
Anna Glazkova
Abstract: The authors compared oversampling methods for the problem of multi-class topic classification. The SMOTE algorithm underlies one of the most popular oversampling methods. It consists in choosing two examples of a minority class and generating a new example based on them. In the paper, the authors compared the basic SMOTE method with its two modifications (Borderline SMOTE and ADASYN) and random oversampling technique on the example of one of text classification tasks. The paper discusses the k-nearest neighbor algorithm, the support vector machine algorithm and three types of neural networks (feedforward network, long short-term memory (LSTM) and bidirectional LSTM). The authors combine these machine learning algorithms with different text representations and compared synthetic oversampling methods. In most cases, the use of oversampling techniques can significantly improve the quality of classification. The authors conclude that for this task, the quality of the KNN and SVM algorithms is more influenced by class imbalance than neural networks.
摘要：作者比较过采样的多类主题分类的问题的方法。该SMOTE算法underlies最流行的过采样的方法之一。它包括在选择少数类的两个例子，并且基于他们一个新的例子。在论文中，作者比较了其两个修饰（交界SMOTE和ADASYN）和关于文本分类任务之一的示例随机过采样技术的基本方法SMOTE。本文讨论了k近邻算法中，支持向量机算法和三种类型的神经网络的（前馈网络，长期短期存储器（LSTM）和双向LSTM）。作者结合这些机器学习算法，用不同的文字表述和比较合成过采样方法。在大多数情况下，使用过采样技术可以显著提高分类的质量。作者认为，对于这个任务，的KNN和SVM算法的质量更受类不平衡不是神经网络的影响。

7. A parallel evaluation data set of software documentation with document structure annotation [PDF] 返回目录
Bianka Buschbeck, Miriam Exel
Abstract: This paper accompanies the software documentation data set for machine translation, a parallel evaluation data set of data originating from the SAP Help Portal, that we release to the machine translation community for research purposes. It offers the possibility to tune and evaluate machine translation systems in the domain of corporate software documentation and contributes to the availability of a wider range of evaluation scenarios. The data set comprises of the language pairs English to Hindi, Indonesian, Malay and Thai, and thus also increases the test coverage for the many low-resource language pairs. Unlike most evaluation data sets that consist of plain parallel text, the segments in this data set come with additional metadata that describes structural information of the document context. We provide insights into the origin and creation, the particularities and characteristics of the data set.
摘要：本文附带的软件文档数据的机器翻译，从SAP帮助门户数据始发的并行评估数据集设置，我们发布到机器翻译界的研究目的。它提供了可能调整和企业软件文档，并有助于向更广泛的评估场景的可用性域评估机器翻译系统。语言对英语印地文，印度尼西亚，马来西亚和泰国，因而该数据集包括也增加了测试覆盖了许多资源匮乏的语言对。不同于由纯文本平行的最评价数据集，在该数据组中的段来与描述文档上下文的结构信息的附加元数据。我们所提供的数据集的见解的起源和创作，特殊性和特点。

8. Topic Adaptation and Prototype Encoding for Few-Shot Visual Storytelling [PDF] 返回目录
Jiacheng Li, Siliang Tang, Juncheng Li, Jun Xiao, Fei Wu, Shiliang Pu, Yueting Zhuang
Abstract: Visual Storytelling~(VIST) is a task to tell a narrative story about a certain topic according to the given photo stream. The existing studies focus on designing complex models, which rely on a huge amount of human-annotated data. However, the annotation of VIST is extremely costly and many topics cannot be covered in the training dataset due to the long-tail topic distribution. In this paper, we focus on enhancing the generalization ability of the VIST model by considering the few-shot setting. Inspired by the way humans tell a story, we propose a topic adaptive storyteller to model the ability of inter-topic generalization. In practice, we apply the gradient-based meta-learning algorithm on multi-modal seq2seq models to endow the model the ability to adapt quickly from topic to topic. Besides, We further propose a prototype encoding structure to model the ability of intra-topic derivation. Specifically, we encode and restore the few training story text to serve as a reference to guide the generation at inference time. Experimental results show that topic adaptation and prototype encoding structure mutually bring benefit to the few-shot model on BLEU and METEOR metric. The further case study shows that the stories generated after few-shot adaptation are more relative and expressive.
摘要：视觉故事〜（VIST）是一个任务根据给定的照片流讲一个故事叙述关于某个话题。已有的研究专注于设计复杂的模型，依靠巨大的人力标注数据量。然而，VIST的注释是极其昂贵的，许多问题不能在训练数据集所覆盖，由于长尾主题分布。在本文中，我们着重考虑的几合一设定增强了VIST模型的泛化能力。顺便说人类讲故事的启发，我们提出了一个话题自适应说书模型话题间泛化能力。在实践中，我们采用多模态seq2seq模型基于梯度的元学习算法，赋予模型从话题迅速适应话题的能力。此外，我们进一步提出了一个原型编码结构模型内的话题派生的能力。具体来说，我们进行编码和恢复训练有数的故事文本，以作为参考来指导产生的推理时间。实验结果表明，话题适应和样机编码结构共同造福于BLEU和流星度量几拍模式。进一步的案例研究表明，一些次改编后产生的故事，更是相对和表现力。

9. Can We Spot the "Fake News" Before It Was Even Written? [PDF] 返回目录
Preslav Nakov
Abstract: Given the recent proliferation of disinformation online, there has been also growing research interest in automatically debunking rumors, false claims, and "fake news." A number of fact-checking initiatives have been launched so far, both manual and automatic, but the whole enterprise remains in a state of crisis: by the time a claim is finally fact-checked, it could have reached millions of users, and the harm caused could hardly be undone. An arguably more promising direction is to focus on fact-checking entire news outlets, which can be done in advance. Then, we could fact-check the news before it was even written: by checking how trustworthy the outlets that published it is. We describe how we do this in the Tanbih news aggregator, which makes readers aware of what they are reading. In particular, we develop media profiles that show the general factuality of reporting, the degree of propagandistic content, hyper-partisanship, leading political ideology, general frame of reporting, and stance with respect to various claims and topics.
摘要：鉴于近期造谣的增殖网上，出现了越来越多也自动揭穿谣言，虚假声明和研究兴趣“假新闻”。许多事实查证举措迄今已启动，手动和自动两种，但整个企业仍处于危机之中的状态：由时间要求是最终的事实核对，它可能已经达到了数百万用户，以及所造成的伤害几乎是百废待兴。可以说是一个更有前途的方向是把重点放在事实查证整个新闻媒体，它可以提前完成。然后，我们会以事实为检查之前就被写的消息：通过检查如何值得信赖的网点已发表它。我们描述了如何在Tanbih新闻聚合器，这使得读者知道他们正在阅读什么做到这一点。特别是，我们开发的媒体简介，显示报告的真实性一般的宣传性内容，超党派的程度，导致政治意识形态，报告的总体框架和立场对于各种索赔和主题。

10. Real-Time Sign Language Detection using Human Pose Estimation [PDF] 返回目录
Amit Moryossef, Ioannis Tsochantaridis, Roee Aharoni, Sarah Ebling, Srini Narayanan
Abstract: We propose a lightweight real-time sign language detection model, as we identify the need for such a case in videoconferencing. We extract optical flow features based on human pose estimation and, using a linear classifier, show these features are meaningful with an accuracy of 80%, evaluated on the DGS Corpus. Using a recurrent model directly on the input, we see improvements of up to 91% accuracy, while still working under 4ms. We describe a demo application to sign language detection in the browser in order to demonstrate its usage possibility in videoconferencing applications.
摘要：我们提出了一个轻量级的实时手语检测模型，因为我们确定在视频会议这样的情况下的需要。我们提取光流基于人类姿势估计的特征和，使用线性分类器，显示这些特征是具有80％的准确度，在DGS语料库评价有意义。直接输入使用复发模型，我们可以看到高达91％的准确率的提高，同时仍在4ms的工作。我们描述了一个演示应用程序登录语言检测浏览器，以证明在视频会议应用它的使用可能性。

11. Spectrum and Prosody Conversion for Cross-lingual Voice Conversion with CycleGAN [PDF] 返回目录
Zongyang Du, Kun Zhou, Berrak Sisman, Haizhou Li
Abstract: Cross-lingual voice conversion aims to change source speaker's voice to sound like that of target speaker, when source and target speakers speak different languages. It relies on non-parallel training data from two different languages, hence, is more challenging than mono-lingual voice conversion. Previous studies on cross-lingual voice conversion mainly focus on spectral conversion with a linear transformation for F0 transfer. However, as an important prosodic factor, F0 is inherently hierarchical, thus it is insufficient to just use a linear method for conversion. We propose the use of continuous wavelet transform (CWT) decomposition for F0 modeling. CWT provides a way to decompose a signal into different temporal scales that explain prosody in different time resolutions. We also propose to train two CycleGAN pipelines for spectrum and prosody mapping respectively. In this way, we eliminate the need for parallel data of any two languages and any alignment techniques. Experimental results show that our proposed Spectrum-Prosody-CycleGAN framework outperforms the Spectrum-CycleGAN baseline in subjective evaluation. To our best knowledge, this is the first study of prosody in cross-lingual voice conversion.
摘要：跨语种语音转换旨在改变源说话的声音听起来像这一目标的扬声器，当源和目标的人说不同的语言。它依赖于非平行训练数据从两个不同的语言，因此，比单语言声音转换更具挑战性。上跨语言声音转换以前的研究主要集中在频谱变换与F0传送的线性变换。然而，作为一个重要因子韵律，F0具有分层的，因此是不够的只是使用线性方法进行转换。我们建议采用连续小波变换为F0造型（CWT）分解。 CWT提供了一种方法来分解信号转换成在不同的时间分辨率解释韵律不同的时间尺度。我们还建议分别班列车2个CycleGAN管道频谱和韵律映射。通过这种方式，我们消除了任何两种语言和任何对准技术的并行数据的需要。实验结果表明，该频谱韵律CycleGAN框架优于主观评价的频谱CycleGAN基线。据我们所知，这是韵律的跨语种语音转换的第一个研究。

12. DensE: An Enhanced Non-Abelian Group Representation for Knowledge Graph Embedding [PDF] 返回目录
Haonan Lu, Hailin Hu
Abstract: Capturing the composition patterns of relations is a vital task in knowledge graph completion. It also serves as a fundamental step towards multi-hop reasoning over learned knowledge. Previously, rotation-based translational methods, e.g., RotatE, have been developed to model composite relations using the product of a series of complex-valued diagonal matrices. However, RotatE makes several oversimplified assumptions on the composition patterns, forcing the relations to be commutative, independent from entities and fixed in scale. To tackle this problem, we have developed a novel knowledge graph embedding method, named DensE, to provide sufficient modeling capacity for complex composition patterns. In particular, our method decomposes each relation into an SO(3) group-based rotation operator and a scaling operator in the three dimensional (3-D) Euclidean space. The advantages of our method are twofold: (1) For composite relations, the corresponding diagonal relation matrices can be non-commutative and related with entity embeddings; (2) It extends the concept of RotatE to a more expressive setting with lower model complexity and preserves the direct geometrical interpretations, which reveals how relations with distinct patterns (i.e., symmetry/anti-symmetry, inversion and composition) are modeled. Experimental results on multiple benchmark knowledge graphs show that DensE outperforms the current state-of-the-art models for missing link prediction, especially on composite relations.
摘要：捕捉关系的构图模式是在知识图完成的重要任务。它也可作为对多跳推理上所学的知识的一个基本步骤。此前，基于旋转平移的方法，例如，旋转，已经发展到利用一系列复值对角矩阵的产品组合关系建模。但是，旋转使上构图模式过于简单的几个假设，迫使关系是可交换的，独立于实体和固定的规模。为了解决这个问题，我们已经开发出一种新的知识图嵌入方法，命名密，以提供复杂的构图模式足够的建模能力。特别地，我们的方法分解在三维（3-d）欧几里得空间的每个关系成SO（3）基于群的旋转算和缩放操作。我们的方法的优点是双重的：（1）对于复合的关系，相应的对角关系基质可以是非交换和与实体相关的嵌入; （2）它扩展旋转的概念，以具有较低模型复杂更有表现设置并保留直接几何解释，其中揭示了如何使用不同的模式（即，对称/反对称性，倒置和组合物）的关系进行建模。在多个基准知识图实验结果表明，密集优于当前国家的最先进的模型缺少的环节预测，特别是在复合关系。

13. Investigation of End-To-End Speaker-Attributed ASR for Continuous Multi-Talker Recordings [PDF] 返回目录
Naoyuki Kanda, Xuankai Chang, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka
Abstract: Recently, an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed as a joint model of speaker counting, speech recognition and speaker identification for monaural overlapped speech. It showed promising results for simulated speech mixtures consisting of various numbers of speakers. However, the model required prior knowledge of speaker profiles to perform speaker identification, which significantly limited the application of the model. In this paper, we extend the prior work by addressing the case where no speaker profile is available. Specifically, we perform speaker counting and clustering by using the internal speaker representations of the E2E SA-ASR model to diarize the utterances of the speakers whose profiles are missing from the speaker inventory. We also propose a simple modification to the reference labels of the E2E SA-ASR training which helps handle continuous multi-talker recordings well. We conduct a comprehensive investigation of the original E2E SA-ASR and the proposed method on the monaural LibriCSS dataset. Compared to the original E2E SA-ASR with relevant speaker profiles, the proposed method achieves a close performance without any prior speaker knowledge. We also show that the source-target attention in the E2E SA-ASR model provides information about the start and end times of the hypotheses.
摘要：近日，一个终端到端到端（E2E）扬声器归于自动语音识别（SA-ASR）模型，提出了作为扬声器计数，语音识别和说话人识别的单声道语音重叠的合资模式。这表明有前途的扬声器组成的各种数字模拟混合语音信号的结果。然而，该模型需要的扬声器轮廓的先验知识来执行说话者识别，这显著限制了模型的应用。在本文中，我们通过解决在没有音箱的个人资料可用的情况下延长了以前的工作。具体来说，我们通过使用E2E SA-ASR模型的内部扬声器交涉diarize，其轮廓从扬声器库存缺少扬声器的发声扬声器进行计算和集群。我们还提出了一个简单的修改到E2E SA-ASR训练的参考标记，这有助于手柄连续多讲话者的记录良好。我们对本地区原有E2E SA-ASR进行全面调查，并在单声道LibriCSS数据集所提出的方法。相比原来的E2E SA-ASR相关扬声器的配置文件，所提出的方法实现没有任何事先的扬声器知识密切的性能。我们还表明，在E2E SA-ASR模型源 - 目标注意提供关于假设的开始和结束时间的信息。

14. Context Reinforced Neural Topic Modeling over Short Texts [PDF] 返回目录
Jiachun Feng, Zusheng Zhang, Cheng Ding, Yanghui Rao, Haoran Xie
Abstract: As one of the prevalent topic mining tools, neural topic modeling has attracted a lot of interests for the advantages of high efficiency in training and strong generalisation abilities. However, due to the lack of context in each short text, the existing neural topic models may suffer from feature sparsity on such documents. To alleviate this issue, we propose a Context Reinforced Neural Topic Model (CRNTM), whose characteristics can be summarized as follows. Firstly, by assuming that each short text covers only a few salient topics, CRNTM infers the topic for each word in a narrow range. Secondly, our model exploits pre-trained word embeddings by treating topics as multivariate Gaussian distributions or Gaussian mixture distributions in the embedding space. Extensive experiments on two benchmark datasets validate the effectiveness of the proposed model on both topic discovery and text classification.
摘要：作为流行话题挖掘工具之一，神经主题建模已经吸引了不少的利益为培训高效率和强大的推广能力的优势。然而，由于缺乏在每个短文本上下文，现有的神经主题模型可以从这些文件的特征稀疏受到影响。为了缓解这一问题，我们提出了一个语境增强神经主题模型（CRNTM），其特点可以概括如下。首先，通过假设每个短文本覆盖只有几个突出的主题，CRNTM推断的话题在一个狭窄的范围内的每个字。其次，我们的模型功勋通过处理主题为嵌入空间的多元高斯分布或高斯混合分布预先训练字的嵌入。两个基准数据集大量的实验验证两个话题发现和文本分类所提出的模型的有效性。

15. Neural PLDA Modeling for End-to-End Speaker Verification [PDF] 返回目录
Shreyas Ramoji, Prashant Krishnan, Sriram Ganapathy
Abstract: While deep learning models have made significant advances in supervised classification problems, the application of these models for out-of-set verification tasks like speaker recognition has been limited to deriving feature embeddings. The state-of-the-art x-vector PLDA based speaker verification systems use a generative model based on probabilistic linear discriminant analysis (PLDA) for computing the verification score. Recently, we had proposed a neural network approach for backend modeling in speaker verification called the neural PLDA (NPLDA) where the likelihood ratio score of the generative PLDA model is posed as a discriminative similarity function and the learnable parameters of the score function are optimized using a verification cost. In this paper, we extend this work to achieve joint optimization of the embedding neural network (x-vector network) with the NPLDA network in an end-to-end (E2E) fashion. This proposed end-to-end model is optimized directly from the acoustic features with a verification cost function and during testing, the model directly outputs the likelihood ratio score. With various experiments using the NIST speaker recognition evaluation (SRE) 2018 and 2019 datasets, we show that the proposed E2E model improves significantly over the x-vector PLDA baseline speaker verification system.
摘要：虽然深学习模式在监督分类问题，取得了显著的进步，这些模型对于像说话人识别出的集核查任务的应用已不限于衍生功能的嵌入。的状态的最先进的x矢量PLDA基于说话者验证系统使用用于计算所述验证分数基于概率线性判别分析（PLDA）生成模型。最近，我们已经提出了用于在说话者检验，其中所述似然比得分的生成PLDA模型被提出作为一个判别相似功能和得分函数的可学习的参数是使用优化的后端建模称为神经PLDA（NPLDA）一神经网络方法验证成本。在本文中，我们扩展这项工作，以实现嵌入神经网络（X-矢量网络）的联合优化与在端至端（E2E）的方式NPLDA网络。该提出的端至端模型直接从与验证成本函数的声学特征的优化和测试过程中，该模型直接输出似然比的分数。随着使用NIST说话人识别评测（SRE）2018年至2019年的数据集各种实验，我们表明，该E2E模型提高显著过X向量PLDA基线说话人验证系统。

16. On Learning Language-Invariant Representations for Universal Machine Translation [PDF] 返回目录
Han Zhao, Junjie Hu, Andrej Risteski
Abstract: The goal of universal machine translation is to learn to translate between any pair of languages, given a corpus of paired translated documents for \emph{a small subset} of all pairs of languages. Despite impressive empirical results and an increasing interest in massively multilingual models, theoretical analysis on translation errors made by such universal machine translation models is only nascent. In this paper, we formally prove certain impossibilities of this endeavour in general, as well as prove positive results in the presence of additional (but natural) structure of data. For the former, we derive a lower bound on the translation error in the many-to-many translation setting, which shows that any algorithm aiming to learn shared sentence representations among multiple language pairs has to make a large translation error on at least one of the translation tasks, if no assumption on the structure of the languages is made. For the latter, we show that if the paired documents in the corpus follow a natural \emph{encoder-decoder} generative process, we can expect a natural notion of ``generalization'': a linear number of language pairs, rather than quadratic, suffices to learn a good representation. Our theory also explains what kinds of connection graphs between pairs of languages are better suited: ones with longer paths result in worse sample complexity in terms of the total number of documents per language pair needed. We believe our theoretical insights and implications contribute to the future algorithmic design of universal machine translation.
摘要：万能机器翻译的目标是学习任何语言对的，因为配对转换的文档的语料库将所有对语言的\ {EMPH一小部分}之间的转换。尽管令人印象深刻的实证结果和大规模多语种模型的兴趣越来越大，对这样的通用机器翻译模型方面的翻译错误理论分析，只有新生。在本文中，我们证明正式这一努力一般的某些不可能性，以及证明数据的附加（但天然的）结构的存在积极的结果。对于前者，我们得出一个下界在许多一对多转换设置的翻译错误，这表明任何算法旨在学习多种语言对共享句话表示已经做出的至少一个在大型翻译错误翻译任务，如果在语言的结构没有做出假设。对于后者，我们证明，如果在语料库中的配对文件遵循自然\ EMPH {编码器，解码器}生成过程中，我们可以期待``泛化“”的自然概念：语言对的线性数量，而不是二次就足够了学习的好代表。我们的理论也解释了什么类型的对语言之间连接的图表更适合：那些具有较长的路径导致更糟糕的样品复杂的依所需的语言对文档总数的条款。我们相信，我们的理论见解和影响开展全球机器翻译的未来算法设计。

17. Transformer with Bidirectional Decoder for Speech Recognition [PDF] 返回目录
Xi Chen, Songyang Zhang, Dandan Song, Peng Ouyang, Shouyi Yin
Abstract: Attention-based models have made tremendous progress on end-to-end automatic speech recognition(ASR) recently. However, the conventional transformer-based approaches usually generate the sequence results token by token from left to right, leaving the right-to-left contexts unexploited. In this work, we introduce a bidirectional speech transformer to utilize the different directional contexts simultaneously. Specifically, the outputs of our proposed transformer include a left-to-right target, and a right-to-left target. In inference stage, we use the introduced bidirectional beam search method, which can not only generate left-to-right candidates but also generate right-to-left candidates, and determine the best hypothesis by the score. To demonstrate our proposed speech transformer with a bidirectional decoder(STBD), we conduct extensive experiments on the AISHELL-1 dataset. The results of experiments show that STBD achieves a 3.6\% relative CER reduction(CERR) over the unidirectional speech transformer baseline. Besides, the strongest model in this paper called STBD-Big can achieve 6.64\% CER on the test set, without language model rescoring and any extra data augmentation strategies.
摘要：基于注意力模型已经对终端到终端的自动语音识别（ASR）最近的巨大进步。然而，传统的基于变压器的方法从左至右通常会产生由令牌令牌序列的结果，使右到左上下文未开发。在这项工作中，我们引入了双向通话变压器同时利用不同方向的背景。具体而言，我们所提出的变压器的输出包括左到右的目标，和右到左的目标。在推论阶段，我们用引进的双向波束搜索方法，它不仅可以产生左到右的候选人，也产生从右到左的候选人，并确定由成绩最好的假设。为了证明我们的双向解码器（STBD）提出的讲话变压器，我们在AISHELL-1数据集进行了广泛的实验。的实验结果表明，STBD实现了3.6 \％的相对减少CER（CERR）在单向语音变压器基线。此外，本文中的最强机型堪称STBD，大可以实现对测试集6.64 \％CER，没有语言模型，再评分和任何额外的数据扩张战略。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-08-12

目录

摘要