摘要

1. Knowledge-Aware Language Model Pretraining [PDF] 返回目录
Corby Rosset, Chenyan Xiong, Minh Phan, Xia Song, Paul Bennett, Saurabh Tiwary
Abstract: How much knowledge do pretrained language models hold? Recent research observed that pretrained transformers are adept at modeling semantics but it is unclear to what degree they grasp human knowledge, or how to ensure they do so. In this paper we incorporate knowledge-awareness in language model pretraining without changing the transformer architecture, inserting explicit knowledge layers, or adding external storage of semantic information. Rather, we simply signal the existence of entities to the input of the transformer in pretraining, with an entity-extended tokenizer; and at the output, with an additional entity prediction task. Our experiments show that solely by adding these entity signals in pretraining, significantly more knowledge is packed into the transformer parameters: we observe improved language modeling accuracy, factual correctness in LAMA knowledge probing tasks, and semantics in the hidden representations through edge probing.We also show that our knowledge-aware language model (KALM) can serve as a drop-in replacement for GPT-2 models, significantly improving downstream tasks like zero-shot question-answering with no task-related training.
摘要：有多少知识就预训练的语言模型持有？最近的研究发现，预训练的变压器善于造型语义，但目前还不清楚他们掌握人类知识什么程度，或者如何确保他们这样做。在本文中，我们结合知识的认识，语言模型，而不改变变压器的架构，将显性知识层，或添加语义信息的外部存储训练前。相反，我们简单地实体的存在信号给变压器的输入在训练前，与实体扩展标记生成器;并在输出端，与另外的实体预测任务。我们的实验表明，仅通过增加训练前，这些实体的信号，显著更多的知识被装入变压器参数：我们观察到改进语言建模精度，LAMA知识事实正确性完成探测任务，和语义通过边缘probing.We还隐藏交涉表明我们的知识感知语言模型（卡姆）可以作为一个下拉更换为GPT-2车型，显著提高下游任务，如无任务相关的培训零投篮回答问题。

2. COVID-19 Literature Knowledge Graph Construction and Drug Repurposing Report Generation [PDF] 返回目录
Qingyun Wang, Manling Li, Xuan Wang, Nikolaus Parulian, Guangxing Han, Jiawei Ma, Jingxuan Tu, Ying Lin, Haoran Zhang, Weili Liu, Aabhas Chauhan, Yingjun Guan, Bangzheng Li, Ruisong Li, Xiangchen Song, Heng Ji, Jiawei Han, Shih-Fu Chang, James Pustejovsky, David Liem, Ahmed Elsayed, Martha Palmer, Jasmine Rah, Cynthia Schneider, Boyan Onyshkevych
Abstract: To combat COVID-19, clinicians and scientists all need to digest the vast amount of relevant biomedical knowledge in literature to understand the disease mechanism and the related biological functions. We have developed a novel and comprehensive knowledge discovery framework, COVID-KG, which leverages novel semantic representation and external ontologies to represent text and images in the input literature data, and then performs various extraction components to extract fine-grained multimedia knowledge elements (entities, relations and events). We then exploit the constructed multimedia KGs for question answering and report generation, using drug repurposing as a case study. Our framework also provides detailed contextual sentences, subfigures and knowledge subgraphs as evidence. All of the data, KGs, resources, and shared services are publicly available.
摘要：为战斗COVID-19，临床医生和科学家都需要消化文学相关的生物医学知识的大量了解病情机制和相关的生物学功能。我们已经开发出一种新颖的和全面的知识发现框架，COVID-KG，它利用新颖语义表示和外部本体来表示在输入文献数据的文本和图像，然后执行各种提取部件提取细粒度多媒体知识元素（实体，关系和事件）。然后，我们利用了答疑和报告生成所构建的多媒体幼儿园，用药物改变用途作为个案研究。我们的框架还提供了详细的上下文的句子，子图和知识子图作为证据。所有的数据，幼儿园，资源和共享服务是公众可获得的。

3. Towards User Friendly Medication Mapping Using Entity-Boosted Two-Tower Neural Network [PDF] 返回目录
Shaoqing Yuan, Parminder Bhatia, Busra Celikkaya, Haiyang Liu, Kyunghwan Choi
Abstract: Recent advancements in medical entity linking have been applied in the area of scientific literature and social media data. However, with the adoption of telemedicine and conversational agents such as Alexa in healthcare settings, medical name inference has become an important task. Medication name inference is the task of mapping user friendly medication names from a free-form text to a concept in a normalized medication list. This is challenging due to the differences in the use of medical terminology from health care professionals and user conversations coming from the lay public. We begin with mapping descriptive medication phrases (DMP) to standard medication names (SMN). Given the prescriptions of each patient, we want to provide them with the flexibility of referring to the medication in their preferred ways. We approach this as a ranking problem which maps SMN to DMP by ordering the list of medications in the patient's prescription list obtained from pharmacies. Furthermore, we leveraged the output of intermediate layers and performed medication clustering. We present the Medication Inference Model (MIM) achieving state-of-the-art results. By incorporating medical entities based attention, we have obtained further improvement for ranking models.
摘要：在医疗实体联的最新发展，在科学文献和社交媒体数据的领域得到应用。然而，随着医疗机构通过远程医疗和会话剂如Alexa，医学名称推断已成为一项重要任务。药物名称推断是从自由格式的文本的概念规范化药物清单映射用户友好的药物名称的任务。这是具有挑战性，由于在使用从医疗保健专业人员和用户对话从普通公众未来医学术语的差异。我们首先映射描述药物的短语（DMP），以标准的药物名称（SMN）。由于每个患者的处方，我们要为他们提供参照药物在他们的首选方式的灵活性。我们接近这个作为一个排名问题，其通过订购从药房获得患者的处方一览药物的列表映射到SMN DMP。此外，我们利用中间层和执行药物的聚类的输出。我们提出了实现国家的最先进成果的药物推理模型（MIM）。通过将根据医疗机构的关注，我们获得进一步改进车型的排名。

4. Iterative Paraphrastic Augmentation with Discriminative Span Alignment [PDF] 返回目录
Ryan Culkin, J. Edward Hu, Elias Stengel-Eskin, Guanghui Qin, Benjamin Van Durme
Abstract: We introduce a novel paraphrastic augmentation strategy based on sentence-level lexically constrained paraphrasing and discriminative span alignment. Our approach allows for the large-scale expansion of existing resources, or the rapid creation of new resources from a small, manually-produced seed corpus. We illustrate our framework on the Berkeley FrameNet Project, a large-scale language understanding effort spanning more than two decades of human labor. Based on roughly four days of collecting training data for the alignment model and approximately one day of parallel compute, we automatically generate 495,300 unique (Frame, Trigger) combinations annotated in context, a roughly 50x expansion atop FrameNet v1.7.
摘要：介绍了基于句子级别的一种新型paraphrastic增强策略词汇释义的限制和歧视性的跨度比对。我们的方法允许大规模扩张现有的资源，或者从一个小的，手动产生的种子语料库的快速创建的新的资源。我们说明我们在伯克利框架网络项目，一个大型的语言理解的努力跨度超过二十年的人类劳动的框架。基于收集的训练数据用于对准模型的大约4天，大约一天并行计算，我们自动生成495300独特（帧，触发器）在上下文中注解的组合，大致50倍扩展的顶上框架网络V1.7。

5. Latent Compositional Representations Improve Systematic Generalization in Grounded Question Answering [PDF] 返回目录
Ben Bogin, Sanjay Subramanian, Matt Gardner, Jonathan Berant
Abstract: Answering questions that involve multi-step reasoning requires decomposing them and using the answers of intermediate steps to reach the final answer. However, state-of-the-art models in grounded question answering often do not explicitly perform decomposition, leading to difficulties in generalization to out-of-distribution examples. In this work, we propose a model that computes a representation and denotation for all question spans in a bottom-up, compositional manner using a CKY-style parser. Our model effectively induces latent trees, driven by end-to-end (the answer) supervision only. We show that this inductive bias towards tree structures dramatically improves systematic generalization to out-of-distribution examples compared to strong baselines on an arithmetic expressions benchmark as well as on CLOSURE, a dataset that focuses on systematic generalization of models for grounded question answering. On this challenging dataset, our model reaches an accuracy of 92.8%, significantly higher than prior models that almost perfectly solve the task on a random, in-distribution split.
摘要：涉及多步推理回答问题需要分解它们，使用的中间步骤的答案，达到最终的答案。然而，国家的最先进的车型在接地的问题往往回答不明确进行分解，导致难以推广到外的分布例。在这项工作中，我们提出了一个计算的代表性和外延的使用CKY式的解析器，自下而上，组成的方式对所有问题跨度的典范。我们的模型有效地诱导潜在的树，通过端至端（答案）驱动的监督而已。我们发现，对树结构这个归纳偏置极大地提高了系统的概括，以比上算术表达式基准以及对强闭幕基线，一个数据集外的分布例，重点机型为接地答疑系统的概括。在这个充满挑战的数据集，我们的模型达到92.8％，比以前的型号，几乎完美地解决一个随机，在分布分割任务显著较高的精度。

6. So What's the Plan? Mining Strategic Planning Document [PDF] 返回目录
Ekaterina Artemova, Tatiana Batura, Anna Golenkovskaya, Vitaly Ivanin, Vladimir Ivanov, Veronika Sarkisyan, Ivan Smurov, Elena Tutubalina
Abstract: In this paper we present a corpus of Russian strategic planning documents, RuREBus. This project is grounded both from language technology and e-government perspectives. Not only new language sources and tools are being developed, but also their applications to e-goverment research. We demonstrate the pipeline for creating a text corpus from scratch. First, the annotation schema is designed. Next texts are marked up using human-in-the-loop strategy, so that preliminary annotations are derived from a machine learning model and are manually corrected. The amount of annotated texts is large enough to showcase what insights can be gained from RuREBus.
摘要：本文提出了俄罗斯的战略规划文件的语料库，RuREBus。这个项目是从语言技术和电子政务的角度接地两者。不仅新的语言和工具的正在制定中，而且他们的应用程序的电子电子政务研究。我们演示了从头开始创建一个文本语料库管道。首先，注释架构设计。接下来文本标记的使用人在半实物策略，让初步的注释是从机器学习模型导出并手动修改。注释文本的量足够大，以展示可以从RuREBus来获得什么样的见解。

7. SemEval-2020 Task 4: Commonsense Validation and Explanation [PDF] 返回目录
Cunxiang Wang, Shuailong Liang, Yili Jin, Yilong Wang, Xiaodan Zhu, Yue Zhang
Abstract: In this paper, we present SemEval-2020 Task 4, Commonsense Validation and Explanation (ComVE), which includes three subtasks, aiming to evaluate whether a system can distinguish a natural language statement that makes sense to human from one that does not, and provide the reasons. Specifically, in our first subtask, the participating systems are required to choose from two natural language statements of similar wording the one that makes sense and the one does not. The second subtask additionally asks a system to select the key reason from three options why a given statement does not make sense. In the third subtask, a participating system needs to generate the reason automatically.
摘要：在本文中，我们提出SemEval-2020任务4，常识验证和说明（ComVE），它包括三个子任务，旨在评估系统是否可以区分自然语言语句是有道理的，以人类从一个没有，并说明理由。具体而言，在我们的第一个子任务，要求参与系统从类似的措词的两个自然语言语句的一个有意义和一个没有选择。第二子任务还问了一个系统来选择从三个选项为什么给出的说法是没有意义的关键原因。在第三子任务，参与的系统需要自动生成的原因。

8. Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation [PDF] 返回目录
Wanrong Zhu, Xin Wang, Tsu-Jui Fu, An Yan, Pradyumna Narayana, Kazoo Sone, Sugato Basu, William Yang Wang
Abstract: In the vision-and-language navigation (VLN) task, an agent follows natural language instructions and navigate in visual environments. Compared to the indoor navigation task that has been broadly studied, navigation in real-life outdoor environments remains a significant challenge with its complicated visual inputs and an insufficient amount of instructions that illustrate the intricate urban scenes. In this paper, we introduce a Multimodal Text Style Transfer (MTST) learning approach to mitigate the problem of data scarcity in outdoor navigation tasks by effectively leveraging external multimodal resources. We first enrich the navigation data by transferring the style of the instructions generated by Google Maps API, then pre-train the navigator with the augmented external outdoor navigation dataset. Experimental results show that our MTST learning approach is model-agnostic, and our MTST approach significantly outperforms the baseline models on the outdoor VLN task, improving task completion rate by 22\% relatively on the test set and achieving new state-of-the-art performance.
摘要：在视觉和语言导航（VLN）任务，代理遵循的视觉环境中的自然语言指令和导航。相比于已被广泛研究的室内导航任务，在现实生活中的室外环境中的导航仍然有其复杂的视觉输入的显著挑战，那说明复杂的城市场景指令量不足。在本文中，我们介绍了学习方法，通过有效地利用外部资源，多式联运，以减轻户外导航任务数据短缺问题多式联运文本样式转移（MTST）。首先，我们通过把由谷歌地图API生成的指令样式丰富的导航数据，然后预培养与增强外部户外导航数据集导航。实验结果表明，我们的MTST学习方法是模型无关，而我们的MTST方法显著优于在室外VLN任务基线模型，在测试组由22 \％提高任务完成率比较，实现新的国家的the-艺术表演。

9. Transferability of Natural Language Inference to Biomedical Question Answering [PDF] 返回目录
Minbyul Jeong, Mujeen Sung, Gangwoo Kim, Donghyeon Kim, Wonjin Yoon, Jaehyo Yoo, Jaewoo Kang
Abstract: Biomedical question answering (QA) is a challenging problem due to the scarcity of data and the requirement of domain expertise. Growing interests of using pre-trained language models with transfer learning address the issue to some extent. Recently, learning linguistic knowledge of entailment in sentence pairs enhances the performance in general domain QA by leveraging such transferability between the two tasks. In this paper, we focus on facilitating the transferability by unifying the experimental setup from natural language inference (NLI) to biomedical QA. We observe that transferring from entailment data shows effective performance on Yes/No (+5.59%), Factoid (+0.53%), List (+13.58%) type questions compared to previous challenge reports (BioASQ 7B Phase B). We also observe that our method generally performs well in the 8th BioASQ Challenge (Phase B). For sequential transfer learning, the order of how tasks are fine-tuned is important. In factoid- and list-type questions, we thoroughly analyze an intrinsic limitation of the extractive QA setting when these questions are converted to the same format of the Stanford Question Answering Dataset (SQuAD).
摘要：生物医学问答（QA）是一个具有挑战性的问题，由于数据的缺乏和领域专业知识的要求。使用预训练语言模型与迁移学习解决该问题在一定程度上的成长利益。最近，学习蕴涵的语言知识在句对提高通过利用两个任务之间的这种转让在通用领域QA性能。在本文中，我们专注于通过统一从自然语言推理（NLI）生物医学QA实验装置促进转让。我们观察到，从蕴涵的数据显示传输上是/否（+ 5.59％），FACTOID（+ 0.53％），高效的性能，相比以往的挑战的报告（BioASQ 7B阶段B）列表（+ 13.58％）式的问题。我们还注意到，我们的方法一般在8 BioASQ挑战赛（B阶段）表现良好。适用于连续传输的学习，如何任务是微调的顺序很重要。在factoid-和列表类型的问题，我们彻底地当这些问题被转换为斯坦福问题的回答集（班）相同的格式分析采掘QA设定的固有限制。

10. Adversarial Mutual Information for Text Generation [PDF] 返回目录
Boyuan Pan, Yazheng Yang, Kaizhao Liang, Bhavya Kailkhura, Zhongming Jin, Xian-Sheng Hua, Deng Cai, Bo Li
Abstract: Recent advances in maximizing mutual information (MI) between the source and target have demonstrated its effectiveness in text generation. However, previous works paid little attention to modeling the backward network of MI (i.e., dependency from the target to the source), which is crucial to the tightness of the variational information maximization lower bound. In this paper, we propose Adversarial Mutual Information (AMI): a text generation framework which is formed as a novel saddle point (min-max) optimization aiming to identify joint interactions between the source and target. Within this framework, the forward and backward networks are able to iteratively promote or demote each other's generated instances by comparing the real and synthetic data distributions. We also develop a latent noise sampling strategy that leverages random variations at the high-level semantic space to enhance the long term dependency in the generation process. Extensive experiments based on different text generation tasks demonstrate that the proposed AMI framework can significantly outperform several strong baselines, and we also show that AMI has potential to lead to a tighter lower bound of maximum mutual information for the variational information maximization problem.
摘要：在最大化源和目标之间的相互信息（MI）的最新进展，展示了其在文本生成效果。然而，以前的作品很少关注MI落后的网络模型（即从目标到源的依赖），这是对变分信息最大化的密封性至关重要的下限。在本文中，我们提出了对抗性互信息（AMI）：其上形成作为新型鞍点文本生成框架（最小 - 最大）优化瞄准来识别源和目标之间的联合的相互作用。在该框架内，前向和后向网络能够迭代地促进或通过比较实际的和合成的数据分布降级彼此的生成实例。我们还开发出在高层次语义空间利用随机变化，以增强在生成过程中的长期相关性的潜在的噪声采样策略。根据不同的文本生成任务，大量实验表明，该AMI架构可以显著胜过几个强大的基线，而我们还显示，AMI有潜力导致下界的最大互信息的变信息最大化问题更紧。

11. OSCaR: Orthogonal Subspace Correction and Rectification of Biases in Word Embeddings [PDF] 返回目录
Sunipa Dev, Tao Li, Jeff M Phillips, Vivek Srikumar
Abstract: Language representations are known to carry stereotypical biases and, as a result, lead to biased predictions in downstream tasks. While existing methods are effective at mitigating biases by linear projection, such methods are too aggressive: they not only remove bias, but also erase valuable information from word embeddings. We develop new measures for evaluating specific information retention that demonstrate the tradeoff between bias removal and information retention. To address this challenge, we propose OSCaR (Orthogonal Subspace Correction and Rectification), a bias-mitigating method that focuses on disentangling biased associations between concepts instead of removing concepts wholesale. Our experiments on gender biases show that OSCaR is a well-balanced approach that ensures that semantic information is retained in the embeddings and bias is also effectively mitigated.
摘要：语言表示已知携带刻板偏见和，因此，导致下游任务偏见预测。虽然现有的方法是在由线性投影减轻偏见有效，这种方法过于激进：他们不仅能去除偏见，而且从字的嵌入擦除有价值的信息。我们开发了用于评估，演示偏置消除和信息保留之间的权衡具体的信息保留的新措施。为了应对这一挑战，我们提出了奥斯卡（正交子空间校正和整流），侧重于解开概念之间的关联偏见，而不是删除概念批发偏置缓解方法。我们对性别偏见实验证明，奥斯卡是一个良好的平衡的办法，确保语义信息保留在嵌入物和偏见也有效缓解。

12. Unsupervised Semantic Hashing with Pairwise Reconstruction [PDF] 返回目录
Casper Hansen, Christian Hansen, Jakob Grue Simonsen, Stephen Alstrup, Christina Lioma
Abstract: Semantic Hashing is a popular family of methods for efficient similarity search in large-scale datasets. In Semantic Hashing, documents are encoded as short binary vectors (i.e., hash codes), such that semantic similarity can be efficiently computed using the Hamming distance. Recent state-of-the-art approaches have utilized weak supervision to train better performing hashing models. Inspired by this, we present Semantic Hashing with Pairwise Reconstruction (PairRec), which is a discrete variational autoencoder based hashing model. PairRec first encodes weakly supervised training pairs (a query document and a semantically similar document) into two hash codes, and then learns to reconstruct the same query document from both of these hash codes (i.e., pairwise reconstruction). This pairwise reconstruction enables our model to encode local neighbourhood structures within the hash code directly through the decoder. We experimentally compare PairRec to traditional and state-of-the-art approaches, and obtain significant performance improvements in the task of document similarity search.
摘要：语义散列是对于大型数据集高效的相似性搜索方法，一种流行的家庭。在语义散列，文档被编码为短二元载体（即，哈希码），以使得语义相似度可以使用汉明距离被有效计算。最近的国家的最先进的方法已经利用监管不力，以更好地列车进行散列车型。受此启发，我们本语义散列与成对重建（PairRec），这是一个离散变自动编码器基于散列模型。 PairRec第一编码弱监督训练对（一个查询文档和语义上类似的文件）分成两个散列码，然后学习由这两个散列码（即，成对的重建）的重建相同的查询的文档。这成对重建使我们的模型直接通过解码器编码的哈希码内当地居委会的结构。我们通过实验比较PairRec在文档相似性搜索的任务，传统和方法先进国家的，并获得显著的性能改进。

13. Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings [PDF] 返回目录
Bowen Shi, Shane Settle, Karen Livescu
Abstract: Segmental models are sequence prediction models in which scores of hypotheses are based on entire variable-length segments of frames. We consider segmental models for whole-word ("acoustic-to-word") speech recognition, with the segment feature vectors defined using acoustic word embeddings. Such models are computationally challenging as the number of paths is proportional to the vocabulary size, which can be orders of magnitude larger than when using subword units like phones. We describe an efficient approach for end-to-end whole-word segmental models, with forward-backward and Viterbi decoding performed on a GPU and a simple segment scoring function that reduces space complexity. In addition, we investigate the use of pre-training via jointly trained acoustic word embeddings (AWEs) and acoustically grounded word embeddings (AGWEs) of written word labels. We find that word error rate can be reduced by a large margin by pre-training the acoustic representation with AWEs, and additional (smaller) gains can be obtained by pre-training the word prediction layer with AGWEs. Our final models improve over comparable A2W models.
摘要：节段性模型是序列预测模型，其中假设的分数是基于帧的整个可变长度段。我们考虑全字段模型（“声，以字”）的语音识别，使用声字的嵌入定义段特征向量。这样的模型在计算上具有挑战性，因为路径数成正比的词汇量，其可以是大小使用子字单元例如手机时大于几个数量级。我们描述了终端到终端的全字段模型的有效方法，与前后和维特比译码在GPU上，并降低空间复杂的简单分段计分函数来执行。此外，我们研究通过共同训练的声字的嵌入（AWES）和文字标签的声学接地字的嵌入（AGWEs）使用前培训。我们发现可以通过大幅度减小通过预培养用AWES声学表示，以及附加的（较小的）增益该单词错误率可通过预训练字预测层用AGWEs来获得。我们的最终模型提高了可比的A2W模型。

14. Modality-Agnostic Attention Fusion for visual search with text feedback [PDF] 返回目录
Eric Dodds, Jack Culpepper, Simao Herdade, Yang Zhang, Kofi Boakye
Abstract: Image retrieval with natural language feedback offers the promise of catalog search based on fine-grained visual features that go beyond objects and binary attributes, facilitating real-world applications such as e-commerce. Our Modality-Agnostic Attention Fusion (MAAF) model combines image and text features and outperforms existing approaches on two visual search with modifying phrase datasets, Fashion IQ and CSS, and performs competitively on a dataset with only single-word modifications, Fashion200k. We also introduce two new challenging benchmarks adapted from Birds-to-Words and Spot-the-Diff, which provide new settings with rich language inputs, and we show that our approach without modification outperforms strong baselines. To better understand our model, we conduct detailed ablations on Fashion IQ and provide visualizations of the surprising phenomenon of words avoiding "attending" to the image region they refer to.
摘要：图像检索与自然语言反馈提供了一个基于细粒度的视觉功能，超越的对象和二进制属性，促进现实世界的应用，如电子商务的目录搜索的承诺。我们的情态无关的注意融合（MAAF）模型结合图片和文字的特点和现有的两种视觉搜索方法对数据集修改短语集，时尚智商和CSS，并进行竞争，只有单字修改，Fashion200k性能优于。我们还推出了两个新的具有挑战性的基准改编自鸟对词和斑点的差速器，它提供了新的设置与丰富的语言输入，并且我们证明了我们的不加修饰的方法优于强基线。为了更好地了解我们的模型，我们进行的时尚智商详细消融和提供的话，避免“参加”他们指的是图像区域的奇怪现象可视化。

15. Multi-view Frequency LSTM: An Efficient Frontend for Automatic Speech Recognition [PDF] 返回目录
Maarten Van Segbroeck, Harish Mallidih, Brian King, I-Fan Chen, Gurpreet Chadha, Roland Maas
Abstract: Acoustic models in real-time speech recognition systems typically stack multiple unidirectional LSTM layers to process the acoustic frames over time. Performance improvements over vanilla LSTM architectures have been reported by prepending a stack of frequency-LSTM (FLSTM) layers to the time LSTM. These FLSTM layers can learn a more robust input feature to the time LSTM layers by modeling time-frequency correlations in the acoustic input signals. A drawback of FLSTM based architectures however is that they operate at a predefined, and tuned, window size and stride, referred to as 'view' in this paper. We present a simple and efficient modification by combining the outputs of multiple FLSTM stacks with different views, into a dimensionality reduced feature representation. The proposed multi-view FLSTM architecture allows to model a wider range of time-frequency correlations compared to an FLSTM model with single view. When trained on 50K hours of English far-field speech data with CTC loss followed by sMBR sequence training, we show that the multi-view FLSTM acoustic model provides relative Word Error Rate (WER) improvements of 3-7% for different speaker and acoustic environment scenarios over an optimized single FLSTM model, while retaining a similar computational footprint.
摘要：在实时语音识别系统的声学模型通常堆叠多个单向LSTM层，用时间来处理声音帧。过香草LSTM架构的性能改进已经报道了通过预先计算的频率 - LSTM（FLSTM）层的堆叠的时间LSTM。这些FLSTM层可通过在声学输入信号建模的时间 - 频率的相关性学习更稳健的输入要素的时间LSTM层。然而基于FLSTM架构的缺点在于，它们在预定的操作，并且调谐，窗口大小和步幅中，被称为在本文中“视图”。我们由多个FLSTM堆的输出与不同视图相结合，成维数降低的特征表示呈现一种简单而有效的修改。所提出的多视图FLSTM架构允许建模更宽范围的时间 - 频率的相关性相比，与单一视图的FLSTM模型。当50K小时CTC损失之后膜生物反应器序列培训英语远场语音数据的训练有素，我们表明，多视图FLSTM声学模型提供相对词错误率（WER）对不同的扬声器和音响的3-7％提高环境场景在一个优化的单FLSTM模式，同时保留了类似的计算足迹。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-07-02

目录

摘要