摘要

1. On Exposure Bias, Hallucination and Domain Shift in Neural Machine Translation [PDF] 返回目录
Chaojun Wang, Rico Sennrich
Abstract: The standard training algorithm in neural machine translation (NMT) suffers from exposure bias, and alternative algorithms have been proposed to mitigate this. However, the practical impact of exposure bias is under debate. In this paper, we link exposure bias to another well-known problem in NMT, namely the tendency to generate hallucinations under domain shift. In experiments on three datasets with multiple test domains, we show that exposure bias is partially to blame for hallucinations, and that training with Minimum Risk Training, which avoids exposure bias, can mitigate this. Our analysis explains why exposure bias is more problematic under domain shift, and also links exposure bias to the beam search problem, i.e. performance deterioration with increasing beam size. Our results provide a new justification for methods that reduce exposure bias: even if they do not increase performance on in-domain test sets, they can increase model robustness to domain shift.
摘要：从曝光补偿和替代算法神经机器翻译（NMT）遭受标准训练算法已经被提出来缓和这一点。然而，曝光补偿的实际影响尚在讨论之中。在本文中，我们链接曝光补偿在NMT另一个众所周知的问题，即下生成域转移幻觉的倾向。在对三个数据集与多个测试网域进行实验，我们表明，曝光补偿是部分要归咎于幻觉，并与风险最小化的培训，培训避免了曝光补偿，可以减轻这种。我们的分析解释了为什么曝光补偿是根据域名转移更多的问题，并也链接曝光补偿的束搜索的问题，即随着光束尺寸性能恶化。我们的研究结果提供了减少曝光补偿方法的新理由：即使他们不域内试验台提高性能，它们可以增加模型的鲁棒性领域转变。

2. Where is Linked Data in Question Answering over Linked Data? [PDF] 返回目录
Tommaso Soru, Edgard Marx, André Valdestilhas, Diego Moussallem, Gustavo Publio, Muhammad Saleem
Abstract: We argue that "Question Answering with Knowledge Base" and "Question Answering over Linked Data" are currently two instances of the same problem, despite one explicitly declares to deal with Linked Data. We point out the lack of existing methods to evaluate question answering on datasets which exploit external links to the rest of the cloud or share common schema. To this end, we propose the creation of new evaluation settings to leverage the advantages of the Semantic Web to achieve AI-complete question answering.
摘要：我们认为，“问答系统与知识库”和“问答在链接数据”是目前同一个问题的两个实例，尽管一个明确的声明来处理链接数据。我们指出，缺乏现有方法来评估问题回答上利用外部链接到云或共享公共架构的其余数据集。为此，我们提出的新的评价设置的创建利用语义Web的优势，实现AI-完整的问答。

3. Learning Robust Models for e-Commerce Product Search [PDF] 返回目录
Thanh V. Nguyen, Nikhil Rao, Karthik Subbian
Abstract: Showing items that do not match search query intent degrades customer experience in e-commerce. These mismatches result from counterfactual biases of the ranking algorithms toward noisy behavioral signals such as clicks and purchases in the search logs. Mitigating the problem requires a large labeled dataset, which is expensive and time-consuming to obtain. In this paper, we develop a deep, end-to-end model that learns to effectively classify mismatches and to generate hard mismatched examples to improve the classifier. We train the model end-to-end by introducing a latent variable into the cross-entropy loss that alternates between using the real and generated samples. This not only makes the classifier more robust but also boosts the overall ranking performance. Our model achieves a relative gain compared to baselines by over 26% in F-score, and over 17% in Area Under PR curve. On live search traffic, our model gains significant improvement in multiple countries.
摘要：显示不匹配的电子商务搜索查询意图降级的客户体验项目。这些不匹配从朝向嘈杂行为信号，诸如在搜索日志点击次数和购买的排名算法的反偏置导致。缓解该问题需要大的标记数据集，这是昂贵的和耗时的，以获得。在本文中，我们开发了一个深刻的，终端到终端模式，学会有效地分类不匹配，并产生硬不匹配的例子，以提高分类。我们通过引入潜在变量进入交叉熵损失训练模型的端至端，使用实部和生成的样本之间交替。这不仅使分类更加健壮，但也提高了整体排名表现。我们的模型实现了超过26％的F-得分相比基准的相对收益，并在公关领域曲线上17％。在实时搜索业务，我们的模型获得多个国家的显著改善。

4. A Tale of Two Perplexities: Sensitivity of Neural Language Models to Lexical Retrieval Deficits in Dementia of the Alzheimer's Type [PDF] 返回目录
Trevor Cohen, Serguei Pakhomov
Abstract: In recent years there has been a burgeoning interest in the use of computational methods to distinguish between elicited speech samples produced by patients with dementia, and those from healthy controls. The difference between perplexity estimates from two neural language models (LMs) - one trained on transcripts of speech produced by healthy participants and the other trained on transcripts from patients with dementia - as a single feature for diagnostic classification of unseen transcripts has been shown to produce state-of-the-art performance. However, little is known about why this approach is effective, and on account of the lack of case/control matching in the most widely-used evaluation set of transcripts (DementiaBank), it is unclear if these approaches are truly diagnostic, or are sensitive to other variables. In this paper, we interrogate neural LMs trained on participants with and without dementia using synthetic narratives previously developed to simulate progressive semantic dementia by manipulating lexical frequency. We find that perplexity of neural LMs is strongly and differentially associated with lexical frequency, and that a mixture model resulting from interpolating control and dementia LMs improves upon the current state-of-the-art for models trained on transcript text exclusively.
摘要：近年来，一直在使用的计算方法一个新兴的利息由老年痴呆症患者产生引起的语音样本进行区分，而那些健康对照。一个训练有素的对健康的参与者和产生的语音的成绩单其他训练有素的成绩单从老年痴呆症患者 - - 作为看不见的成绩单的诊断分类的单一功能已被证明产生从两个神经语言模型（LMS）的困惑估计值之间的差异状态的最先进的性能。然而，鲜为人知的是，为什么这种方法是有效的，并且由于缺乏最广泛使用的评估组成绩单（DementiaBank）的情况下/控制匹配的，如果这些方法是真正的诊断，或者是敏感的，目前尚不清楚其它变量。在本文中，我们询问培训的参与者与不使用以前通过操纵词频来模拟逐行语义痴呆合成叙事性痴呆的神经LM的。我们发现神经的LMS是困惑强烈和差异与词汇的频率有关，以及从内插控制和老年痴呆症的LM产生的混合模型在当前国家的最先进的用于训练的成绩单上独家文字车型提高。

5. Learning Implicit Text Generation via Feature Matching [PDF] 返回目录
Inkit Padhi, Pierre Dognin, Ke Bai, Cicero Nogueira dos Santos, Vijil Chenthamarakshan, Youssef Mroueh, Payel Das
Abstract: Generative feature matching network (GFMN) is an approach for training implicit generative models for images by performing moment matching on features from pre-trained neural networks. In this paper, we present new GFMN formulations that are effective for sequential data. Our experimental results show the effectiveness of the proposed method, SeqGFMN, for three distinct generation tasks in English: unconditional text generation, class-conditional text generation, and unsupervised text style transfer. SeqGFMN is stable to train and outperforms various adversarial approaches for text generation and text style transfer.
摘要：生成性特征匹配网络（GFMN）是由预训练神经网络的特点进行矩匹配隐含训练生成模型的图像的方法。在本文中，我们提出了新的GFMN制剂是有效的连续数据。我们的实验结果表明，该方法的有效性，SeqGFMN，在英语三种不同的生成任务：无条件的文本生成，分类条件文本生成，且无人监管的文本样式转移。 SeqGFMN是稳定的训练，优于文本产生和文本样式传递各种对抗性的方法。

6. MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis [PDF] 返回目录
Devamanyu Hazarika, Roger Zimmermann, Soujanya Poria
Abstract: Multimodal Sentiment Analysis is an active area of research that leverages multimodal signals for affective understanding of user-generated videos. The predominant approach, addressing this task, has been to develop sophisticated fusion techniques. However, the heterogeneous nature of the signals creates distributional modality gaps that pose significant challenges. In this paper, we aim to learn effective modality representations to aid the process of fusion. We propose a novel framework, MISA, which projects each modality to two distinct subspaces. The first subspace is modality invariant, where the representations across modalities learn their commonalities and reduce the modality gap. The second subspace is modality-specific, which is private to each modality and captures their characteristic features. These representations provide a holistic view of the multimodal data, which is used for fusion that leads to task predictions. Our experiments on popular sentiment analysis benchmarks, MOSI and MOSEI, demonstrate significant gains over state-of-the-art models. We also consider the task of Multimodal Humor Detection and experiment on the recently proposed UR_FUNNY dataset. Here too, our model fares better than strong baselines, establishing MISA as a useful multimodal framework.
摘要：多模态情感分析是，它利用多信号为用户生成的视频情感的理解研究的活跃领域。主要的方法，解决这个任务，一直以开发先进的融合技术。然而，该信号的异质性创建带来显著挑战分布形态差距。在本文中，我们的目标是要学会有效方式表示，以帮助融合过程。我们提出了一个新的框架，MISA，哪些项目每一种模式到两个不同的子空间。第一子空间形态不变，其中跨方式的表述学习他们的共性，降低了形态的差距。所述第二子空间是模态特异性的，其是私有的每种模态，并捕获其特征。这些表示提供多模数据，这是用来进行融合，导致任务预测的整体视图。我们对民心分析基准，MOSI和MOSEI，实验表明，在国家的最先进的车型显著的收益。我们还认为，多式联运幽默检测对最近提出UR_FUNNY数据集的任务和实验。在此，我们的模型比票价强基线，建立MISA作为一个有用的多框架更好。

7. The Danish Gigaword Project [PDF] 返回目录
Leon Strømberg-Derczynski, Rebekah Baglini, Morten H. Christiansen, Manuel R. Ciosici, Jacob Aarup Dalsgaard, Riccardo Fusaroli, Peter Juel Henrichsen, Rasmus Hvingelby, Andreas Kirkedal, Alex Speed Kjeldsen, Claus Ladefoged, Finn Årup Nielsen, Malte Lau Petersen, Jonathan Hvithamar Rystrøm, Daniel Varab
Abstract: Danish is a North Germanic/Scandinavian language spoken primarily in Denmark, a country with a tradition of technological and scientific innovation. However, from a technological perspective, the Danish language has received relatively little attention and, as a result, Danish language technology is hard to develop, in part due to a lack of large or broad-coverage Danish corpora. This paper describes the Danish Gigaword project, which aims to construct a freely-available one billion word corpus of Danish text that represents the breadth of the written language.
摘要：丹麦主要在丹麦，有技术和科学创新的传统的国家讲北日耳曼/斯堪的纳维亚的语言。但是，从技术角度来看，丹麦语已收到相对较少的关注和报道，其结果是，丹麦的语言技术是很难发展，部分原因是由于缺乏大型或广泛的覆盖丹麦语料库。本文介绍了丹麦Gigaword项目，旨在构建一个代表书面语言的广度丹麦文本的可自由可用的一个十亿字语料其中。

8. Practical Perspectives on Quality Estimation for Machine Translation [PDF] 返回目录
Junpei Zhou, Ciprian Chelba, Yuezhang
Abstract: Sentence level quality estimation (QE) for machine translation (MT) attempts to predict the translation edit rate (TER) cost of post-editing work required to correct MT output. We describe our view on sentence-level QE as dictated by several practical setups encountered in the industry. We find consumers of MT output---whether human or algorithmic ones---to be primarily interested in a binary quality metric: is the translated sentence adequate as-is or does it need post-editing? Motivated by this we propose a quality classification (QC) view on sentence-level QE whereby we focus on maximizing recall at precision above a given threshold. We demonstrate that, while classical QE regression models fare poorly on this task, they can be re-purposed by replacing the output regression layer with a binary classification one, achieving 50-60\% recall at 90\% precision. For a high-quality MT system producing 75-80\% correct translations, this promises a significant reduction in post-editing work indeed.
摘要：机器翻译（MT）试图预测需要正确的MT输出后编辑工作的翻译编辑率（TER）成本句子层面的质量估计（QE）。我们描述了在行业遇到了几个实用的设置所决定我们对句子级QE视图。我们发现MT输出的消费者---不管是人还是那些算法---将主要兴趣在一个二进制质量度量：是翻译句子足够原样还是需要后期编辑？通过此激励我们提出在句子级QE一个质量分类（QC）视图由此我们专注于在精度高于给定阈值最大化召回。我们证明了，而经典的QE回归模型对这项任务表现不佳，他们可以重新定意用二元分类一个替代输出回归层，在90 \％的准确率达到50〜60 \％的召回。对于高质量MT系统生产75-80 \％准确的翻译，这个承诺在后期编辑工作确实是一个显著减少。

9. Reference and Document Aware Semantic Evaluation Methods for Korean Language Summarization [PDF] 返回目录
Dongyub Lee, Myeongcheol Shin, Taesun Whang, Seungwoo Cho, Byeongil Ko, Daniel Lee, Eunggyun Kim, Jaechoon Jo
Abstract: Text summarization refers to the process that generates a shorter form of text from the source document preserving salient information. Recently, many models for text summarization have been proposed. Most of those models were evaluated using recall-oriented understudy for gisting evaluation (ROUGE) scores. However, as ROUGE scores are computed based on n-gram overlap, they do not reflect semantic meaning correspondences between generated and reference summaries. Because Korean is an agglutinative language that combines various morphemes into a word that express several meanings, ROUGE is not suitable for Korean summarization. In this paper, we propose evaluation metrics that reflect semantic meanings of a reference summary and the original document, Reference and Document Aware Semantic Score (RDASS). We then propose a method for improving the correlation of the metrics with human judgment. Evaluation results show that the correlation with human judgment is significantly higher for our evaluation metrics than for ROUGE scores.
摘要：文本摘要是指生成从源文档保存着的信息文本的较短形式的过程。最近，文摘许多模型已被提出。使用面向召回，替补为gisting评价（ROUGE）得分大多数这些模型进行了评价。然而，如ROUGE得分是基于n元语法来计算重叠，它们并没有反映产生并参考摘要之间的语义对应性。由于韩国是结合不同的语素成表达多种意思的单词的粘着语，ROUGE是不适合韩国的总结。在本文中，我们提出了评价指标，反映了一个参考摘要和原始文件，参考文献和文件感知的语义得分（RDASS）的语义。然后，我们提出了改善与人的判断指标的相关性的方法。评价结果表明，与人类判断的关系对我们的评价指标比ROUGE得分显著较高。

10. Fine-Grained Analysis of Cross-Linguistic Syntactic Divergences [PDF] 返回目录
Dmitry Nikolaev, Ofir Arviv, Taelin Karidi, Neta Kenneth, Veronika Mitnik, Lilja Maria Saeboe, Omri Abend
Abstract: The patterns in which the syntax of different languages converges and diverges are often used to inform work on cross-lingual transfer. Nevertheless, little empirical work has been done on quantifying the prevalence of different syntactic divergences across language pairs. We propose a framework for extracting divergence patterns for any language pair from a parallel corpus, building on Universal Dependencies. We show that our framework provides a detailed picture of cross-language divergences, generalizes previous approaches, and lends itself to full automation. We further present a novel dataset, a manually word-aligned subset of the Parallel UD corpus in five languages, and use it to perform a detailed corpus study. We demonstrate the usefulness of the resulting analysis by showing that it can help account for performance patterns of a cross-lingual parser.
摘要：在不同的语言和收敛发散的语法经常被用来对跨语言转让通知工作的模式。然而，缺乏实证工作已经跨语言对量化不同的语法分歧的流行做。我们提出了提取发散模式从平行语料库任何语言对，建立在通用的依赖关系的框架。我们证明了我们的框架提供跨语言分歧的详细图片，概括以前的方法，以及适合于全自动化。我们进一步提出了一种新的数据集，以五种语言的平行语料库UD的手动字对齐的子集，并用它来进行详细的语料研究。我们通过展示，它可以帮助占了跨语言解析器表演方式展示分析结果的有效性。

11. The Perceptimatic English Benchmark for Speech Perception Models [PDF] 返回目录
Juliette Millet, Ewan Dunbar
Abstract: We present the Perceptimatic English Benchmark, an open experimental benchmark for evaluating quantitative models of speech perception in English. The benchmark consists of ABX stimuli along with the responses of 91 American English-speaking listeners. The stimuli test discrimination of a large number of English and French phonemic contrasts. They are extracted directly from corpora of read speech, making them appropriate for evaluating statistical acoustic models (such as those used in automatic speech recognition) trained on typical speech data sets. We show that phone discrimination is correlated with several types of models, and give recommendations for researchers seeking easily calculated norms of acoustic distance on experimental stimuli. We show that DeepSpeech, a standard English speech recognizer, is more specialized on English phoneme discrimination than English listeners, and is poorly correlated with their behaviour, even though it yields a low error on the decision task given to humans.
摘要：我们提出的Perceptimatic英语基准，开放实验基准英语评估言语感知的定量模型。基准由ABX刺激与91美国讲英语的听众的响应一起。大量的英语和法语音素对比的刺激试验歧视。他们是直接从读取语音的语料库中提取，使他们适当的评估统计声学模型（如在自动语音识别使用）训练有素的典型发言的数据集。我们发现，手机的歧视与多种类型的模型相关，并给出寻求实验刺激声的距离很容易地计算规范的研究人员建议。我们发现，DeepSpeech，一个标准的英语语音识别，更专业的非英语的听众英语音素识别，并与他们的行为相关性很差，尽管它的产量在给人类决策任务低错误。

12. Does Multi-Encoder Help? A Case Study on Context-Aware Neural Machine Translation [PDF] 返回目录
Bei Li, Hui Liu, Ziyang Wang, Yufan Jiang, Tong Xiao, Jingbo Zhu, Tongran Liu, Changliang Li
Abstract: In encoder-decoder neural models, multiple encoders are in general used to represent the contextual information in addition to the individual sentence. In this paper, we investigate multi-encoder approaches in documentlevel neural machine translation (NMT). Surprisingly, we find that the context encoder does not only encode the surrounding sentences but also behaves as a noise generator. This makes us rethink the real benefits of multi-encoder in context-aware translation - some of the improvements come from robust training. We compare several methods that introduce noise and/or well-tuned dropout setup into the training of these encoders. Experimental results show that noisy training plays an important role in multi-encoder-based NMT, especially when the training data is small. Also, we establish a new state-of-the-art on IWSLT Fr-En task by careful use of noise generation and dropout methods.
摘要：在编码解码器的神经模型，多编码器一般用来表示除个别句子的上下文信息。在本文中，我们研究了多编码器在documentlevel神经机器翻译（NMT）方法。出人意料的是，我们发现，上下文编码器不仅编码周围的句子，还表现为噪声发生器。这使我们重新思考多编码器的上下文感知的翻译真正的实惠 - 一些改进来自于强劲的训练。我们比较了几种方法引入噪声和/或调整良好的降设置到这些编码器的训练。实验结果表明，嘈杂的训练在基于多编码器NMT了重要的作用，尤其是当训练数据较少。此外，我们通过谨慎使用产生噪声和辍学方法建立一个新的IWSLT FR-恩任务的国家的最先进的。

13. 2kenize: Tying Subword Sequences for Chinese Script Conversion [PDF] 返回目录
Pranav A, Isabelle Augenstein
Abstract: Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have poor performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method's particular strengths are in dealing with code-mixing and named entities.
摘要：中国简体中国传统字符转换为中国NLP常见的前工序。尽管如此，目前的做法有表现不佳，因为他们没有考虑到简化中国角色可以对应多个繁体字。在这里，我们建议可以在两个脚本之间的映射和转换间的歧义的模型。该模型是基于子字分割，二语言模型，以及用于子字序列之间的映射的方法。我们进一步构建基准数据集的主题分类和脚本转换。我们提出的方法优于以前的中国汉字转换的精度6点接近。这些结果在下游应用，其中2kenize用于训练前数据集转换为主题分类进一步证实。错误分析表明，我们的方法的特别的优点是在处理代码混合和命名实体。

14. JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation [PDF] 返回目录
Zhuoyuan Mao, Fabien Cromieres, Raj Dabre, Haiyue Song, Sadao Kurohashi
Abstract: Neural machine translation (NMT) needs large parallel corpora for state-of-the-art translation quality. Low-resource NMT is typically addressed by transfer learning which leverages large monolingual or parallel corpora for pre-training. Monolingual pre-training approaches such as MASS (MAsked Sequence to Sequence) are extremely effective in boosting NMT quality for languages with small parallel corpora. However, they do not account for linguistic information obtained using syntactic analyzers which is known to be invaluable for several Natural Language Processing (NLP) tasks. To this end, we propose JASS, Japanese-specific Sequence to Sequence, as a novel pre-training alternative to MASS for NMT involving Japanese as the source or target language. JASS is joint BMASS (Bunsetsu MASS) and BRSS (Bunsetsu Reordering Sequence to Sequence) pre-training which focuses on Japanese linguistic units called bunsetsus. In our experiments on ASPEC Japanese--English and News Commentary Japanese--Russian translation we show that JASS can give results that are competitive with if not better than those given by MASS. Furthermore, we show for the first time that joint MASS and JASS pre-training gives results that significantly surpass the individual methods indicating their complementary nature. We will release our code, pre-trained models and bunsetsu annotated data as resources for researchers to use in their own NLP tasks.
摘要：神经机器翻译（NMT）需求状态的最先进的翻译质量大平行语料库。低资源NMT通常是通过迁移学习解决它利用大单语或前培训平行语料库。单语预先训练方法，如MASS（掩蔽序列到序列）是在升压NMT质量对于具有小平行语料库语言非常有效的。然而，他们没有考虑使用它被称为是无价的几个自然语言处理（NLP）任务，句法分析获得的语言信息。为此，我们提出JASS，日本特定序列到序列，作为一种新的预训练替代MASS为NMT涉及日本为源或目标语言。 JASS是共同的BMAS（Bunsetsu MASS）和BRSS（Bunsetsu重新排序序列具有Sequence）前的培训侧重于日本称为bunsetsus语言单位。在我们的ASPEC日本实验 - 英语和新闻评论日本 - 俄语翻译，我们表明，JASS可以给出的结果与如果不超过质量给出的更好的竞争。此外，我们显示了第一次联合的质量和JASS前培训使结果显著超越表明其互补性的各个方法。我们将发布我们的代码，预先训练模式和bunsetsu注释的数据作为研究人员利用自己的NLP任务的资源。

15. DramaQA: Character-Centered Video Story Understanding with Hierarchical QA [PDF] 返回目录
Seongho Choi, Kyoung-Woon On, Yu-Jung Heo, Ahjeong Seo, Youwon Jang, Seungchan Lee, Minsu Lee, Byoung-Tak Zhang
Abstract: Despite recent progress on computer vision and natural language processing, developing video understanding intelligence is still hard to achieve due to the intrinsic difficulty of story in video. Moreover, there is not a theoretical metric for evaluating the degree of video understanding. In this paper, we propose a novel video question answering (Video QA) task, DramaQA, for a comprehensive understanding of the video story. The DramaQA focused on two perspectives: 1) hierarchical QAs as an evaluation metric based on the cognitive developmental stages of human intelligence. 2) character-centered video annotations to model local coherence of the story. Our dataset is built upon the TV drama "Another Miss Oh" and it contains 16,191 QA pairs from 23,928 various length video clips, with each QA pair belonging to one of four difficulty levels. We provide 217,308 annotated images with rich character-centered annotations, including visual bounding boxes, behaviors, and emotions of main characters, and coreference resolved scripts. Additionally, we provide analyses of the dataset as well as Dual Matching Multistream model which effectively learns character-centered representations of video to answer questions about the video. We are planning to release our dataset and model publicly for research purposes and expect that our work will provide a new perspective on video story understanding research.
摘要：尽管在计算机视觉和自然语言处理，开发视频了解智能的最新进展仍难以实现，由于故事的视频中的内在困境。此外，没有评估的视频了解程度的理论度量。在本文中，我们提出了一种新的视频问答（视频QA）任务，DramaQA，用于视频故事的一个全面的了解。该DramaQA集中在两个方面：1）层次的问题答案是基于人类智能的认知发展阶段的评估指标。 2）字符为中心的视频注释建模的故事的局部相干性。我们的数据是在电视剧“另一种哦小姐”建成，它包含23928个多种长度的视频剪辑16191 QA对，属于四个难度等级各一个QA对。我们提供217308幅注释的图像具有丰富的角色为中心的注解，包括视觉边框，行为和主要人物的情绪，共指解决脚本。此外，我们提供的数据集的分析，以及有效地学习视频的角色为中心的交涉有关视频答题双匹配多数据流模型。我们计划公开发布我们的数据和模型为研究目的和期望，我们的工作将提供视频故事的理解研究的新视角。

16. Nakdan: Professional Hebrew Diacritizer [PDF] 返回目录
Avi Shmidman, Shaltiel Shmidman, Moshe Koppel, Yoav Goldberg
Abstract: We present a system for automatic diacritization of Hebrew text. The system combines modern neural models with carefully curated declarative linguistic knowledge and comprehensive manually constructed tables and dictionaries. Besides providing state of the art diacritization accuracy, the system also supports an interface for manual editing and correction of the automatic output, and has several features which make it particularly useful for preparation of scientific editions of Hebrew texts. The system supports Modern Hebrew, Rabbinic Hebrew and Poetic Hebrew. The system is freely accessible for all use at this http URL.
摘要：本文提出了一种系统，用于希伯来文的自动diacritization。该系统结合了精心策划的陈述性语言知识和全面的手动构造表和字典现代神经模型。除了提供本领域diacritization精度的状态下，系统还支持手动编辑和自动输出的校正的接口，并且具有几个特征，这使得它对于制备希伯来语文本的科学版本的特别有用的。该系统支持现代希伯来文，希伯来拉比与诗希伯来语。该系统是在这个HTTP URL都使用可以自由进出。

17. Quda: Natural Language Queries for Visual Data Analytics [PDF] 返回目录
Siwei Fu, Kai Xiong, Xiaodong Ge, Yingcai Wu, Siliang Tang, Wei Chen
Abstract: Visualization-oriented natural language interfaces (V-NLIs) have been explored and developed in recent years. One challenge faced by V-NLIs is in the formation of effective design decisions that usually requires a deep understanding of user queries. Learning-based approaches have shown potential in V-NLIs and reached state-of-the-art performance in various NLP tasks. However, because of the lack of sufficient training samples that cater to visual data analytics, cutting-edge techniques have rarely been employed to facilitate the development of V-NLIs. We present a new dataset, called Quda, to help V-NLIs understand free-form natural language. Our dataset contains 14;035 diverse user queries annotated with 10 low-level analytic tasks that assist in the deployment of state-of-the-art techniques for parsing complex human language. We achieve this goal by first gathering seed queries with data analysts who are target users of V-NLIs. Then we employ extensive crowd force for paraphrase generation and validation. We demonstrate the usefulness of Quda in building V-NLIs by creating a prototype that makes effective design decisions for free-form user queries. We also show that Quda can be beneficial for a wide range of applications in the visualization community by analyzing the design tasks described in academic publications.
摘要：面向可视化，自然语言界面（V-NLIS）已经探索和近年来发展起来的。面对V-NLIS的一个挑战是在形成有效的设计决策，通常需要用户查询的深刻理解。基于学习的方法已经在V-NLIS出潜力，达到在不同的NLP任务的国家的最先进的性能。然而，由于缺乏足够的训练样本，以迎合视觉数据分析，前沿的技术很少被用来促进V-NLIS的发展。我们提出了一个新的数据集，名为Quda，以帮助V-NLIS理解自由形式的自然语言。我们的数据包含14个;与协助的国家的最先进的技术，部署解析复杂的人类语言的10个低级别的分析任务注释035不同的用户查询。我们实现与数据分析谁是V-NLIS的目标用户首次聚会种子查询这个目标。然后，我们采用了意译生成和验证广泛人群的力量。我们证明通过创建一个原型，使有效的设计决策的自由形式的用户查询中构建V-NLIS Quda的有效性。我们还表明，Quda可以通过分析在学术文献中所述的设计任务是为广泛的可视化社区应用是有益的。

18. Fact-based Dialogue Generation with Convergent and Divergent Decoding [PDF] 返回目录
Ryota Tanaka, Akinobu Lee
Abstract: Fact-based dialogue generation is a task of generating a human-like response based on both dialogue context and factual texts. Various methods were proposed to focus on generating informative words that contain facts effectively. However, previous works implicitly assume a topic to be kept on a dialogue and usually converse passively, therefore the systems have a difficulty to generate diverse responses that provide meaningful information proactively. This paper proposes an end-to-end Fact-based dialogue system augmented with the ability of convergent and divergent thinking over both context and facts, which can converse about the current topic or introduce a new topic. Specifically, our model incorporates a novel convergent and divergent decoding that can generate informative and diverse responses considering not only given inputs (context and facts) but also inputs-related topics. Both automatic and human evaluation results on DSTC7 dataset show that our model significantly outperforms state-of-the-art baselines, indicating that our model can generate more appropriate, informative, and diverse responses.
摘要：基于事实的对话一代产生基于两个方面的对话和事实的文本类似人类的反应的任务。提出了各种方法把重点放在生成有效遏制的事实信息的话。然而，以前的作品暗含的假设将被保留在一个对话的话题，通常逆被动，因此系统有困难生成主动提供有意义的信息不同的反应。本文提出了一种具有收敛的能力和发散性思维在两个背景和事实，可以交谈关于当前话题或引入新的话题增加一个终端到终端的基于事实的对话系统。具体来说，我们的模型采用了新颖的收敛和发散解码，可以产生不仅考虑给予输入（上下文和事实），而且还投入相关的主题内容丰富多样反应。在DSTC7数据集上自动和人工评估结果，我们的模型显著性能优于国家的最先进的基线，这表明我们的模型能够产生更合适，内容翔实，多样的反应。

19. Unsupervised Multimodal Neural Machine Translation with Pseudo Visual Pivoting [PDF] 返回目录
Po-Yao Huang, Junjie Hu, Xiaojun Chang, Alexander Hauptmann
Abstract: Unsupervised machine translation (MT) has recently achieved impressive results with monolingual corpora only. However, it is still challenging to associate source-target sentences in the latent space. As people speak different languages biologically share similar visual systems, the potential of achieving better alignment through visual content is promising yet under-explored in unsupervised multimodal MT (MMT). In this paper, we investigate how to utilize visual content for disambiguation and promoting latent space alignment in unsupervised MMT. Our model employs multimodal back-translation and features pseudo visual pivoting in which we learn a shared multilingual visual-semantic embedding space and incorporate visually-pivoted captioning as additional weak supervision. The experimental results on the widely used Multi30K dataset show that the proposed model significantly improves over the state-of-the-art methods and generalizes well when the images are not available at the testing time.
摘要：无监督机器翻译（MT）最近取得了只有单语语料库骄人的成绩。然而，它仍然是具有挑战性的潜在空间关联源 - 目标的句子。随着人们讲不同语言的生物有着相似的视觉系统，实现了通过视觉内容更加协调一致的潜力仍然看好下，探索在无人监督的多式联运MT（MMT）。在本文中，我们研究如何利用消除歧义，促进监督的MMT潜在空间定位的视觉内容。我们的模型采用了多回译和伪功能中，我们学到了共享多语种视觉语义嵌入空间，并纳入视觉摆动字幕作为附加的监管不力视觉旋转。广泛使用的Multi30K数据集显示，该模型的国家的最先进的方法，概括显著改进了很好的图像时，不可在测试时间的实验结果。

20. Diagnosing the Environment Bias in Vision-and-Language Navigation [PDF] 返回目录
Yubo Zhang, Hao Tan, Mohit Bansal
Abstract: Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions, explore the given environments, and reach the desired target locations. These step-by-step navigational instructions are crucial when the agent is navigating new environments about which it has no prior knowledge. Most recent works that study VLN observe a significant performance drop when tested on unseen environments (i.e., environments not used in training), indicating that the neural agent models are highly biased towards training environments. Although this issue is considered as one of the major challenges in VLN research, it is still under-studied and needs a clearer explanation. In this work, we design novel diagnosis experiments via environment re-splitting and feature replacement, looking into possible reasons for this environment bias. We observe that neither the language nor the underlying navigational graph, but the low-level visual appearance conveyed by ResNet features directly affects the agent model and contributes to this environment bias in results. According to this observation, we explore several kinds of semantic representations that contain less low-level visual information, hence the agent learned with these features could be better generalized to unseen testing environments. Without modifying the baseline agent model and its training method, our explored semantic features significantly decrease the performance gaps between seen and unseen on multiple datasets (i.e. R2R, R4R, and CVDN) and achieve competitive unseen results to previous state-of-the-art models. Our code and features are available at: this https URL
摘要：视觉和语言导航（VLN）需要代理遵循自然语言指令，探索给定的环境，并达到预期的目标位置。当代理导航关于它没有先验知识的新环境下，这些一步一步的导航指令是至关重要的。最近的作品为：当在看不见的环境下进行试验研究VLN观察显著性能下降（即，在训练中不使用环境），表明神经代理模式是高度对训练环境偏见。虽然这个问题被认为是在VLN研究的主要挑战之一，它仍然在深入研究的，需要更清楚的解释。在这项工作中，我们通过环境重新分割和特征替换设计新颖的诊断试验，寻找到这个环境偏差的可能原因。我们观察到，无论是语言还是基本的导航图，但低级别的视觉外观传达由RESNET功能直接影响到代理模式，并有助于在结果这种环境偏差。根据这一观察，我们探索了几种含有较少的低级别的视觉信息，因此这些功能学到的代理可以更好地推广到看不见的测试环境的语义表示。在不修改基线代理模式和它的训练方法，我们的探索语义特征显著降低之间的性能差距看见和看不见的多个数据集（即R2R，R4R和CVDN），并获得竞争看不见的结果以前的状态的最先进的楷模。我们的代码和功能，请访问：此HTTPS URL

21. Categorical Vector Space Semantics for Lambek Calculus with a Relevant Modality [PDF] 返回目录
Lachlan McPheat, Mehrnoosh Sadrzadeh, Hadi Wazni, Gijs Wijnholds
Abstract: We develop a categorical compositional distributional semantics for Lambek Calculus with a Relevant Modality !L*, which has a limited edition of the contraction and permutation rules. The categorical part of the semantics is a monoidal biclosed category with a coalgebra modality, very similar to the structure of a Differential Category. We instantiate this category to finite dimensional vector spaces and linear maps via "quantisation" functors and work with three concrete interpretations of the coalgebra modality. We apply the model to construct categorical and concrete semantic interpretations for the motivating example of !L*: the derivation of a phrase with a parasitic gap. The effectiveness of the concrete interpretations are evaluated via a disambiguation task, on an extension of a sentence disambiguation dataset to parasitic gap phrase one, using BERT, Word2Vec, and FastText vectors and Relational tensors.
摘要：我们制定了明确的成分分布语义Lambek微积分与相关情态L *，其具有的收缩和排列规则的限量版！语义的绝对部分是与余代数模态monoidal biclosed类，非常类似于差分类别的结构。我们实例化这个类别有限维向量空间，并用“量化”函子线性图和工作与余代数形式的三个具体的解释。我们应用此模型来构建绝对的和具体的语义解释为的激励例如L *：一个短语与寄生差距推导。具体解释的有效性是通过一个消歧任务评价，在一个句子消歧数据集以寄生间隙短语之一的延伸，使用BERT，Word2Vec，和FastText矢量和关系张量。

22. Weakly-Supervised Neural Response Selection from an Ensemble of Task-Specialised Dialogue Agents [PDF] 返回目录
Asir Saeed, Khai Mai, Pham Minh, Nguyen Tuan Duc, Danushka Bollegala
Abstract: Dialogue engines that incorporate different types of agents to converse with humans are popular. However, conversations are dynamic in the sense that a selected response will change the conversation on-the-fly, influencing the subsequent utterances in the conversation, which makes the response selection a challenging problem. We model the problem of selecting the best response from a set of responses generated by a heterogeneous set of dialogue agents by taking into account the conversational history, and propose a \emph{Neural Response Selection} method. The proposed method is trained to predict a coherent set of responses within a single conversation, considering its own predictions via a curriculum training mechanism. Our experimental results show that the proposed method can accurately select the most appropriate responses, thereby significantly improving the user experience in dialogue systems.
摘要：对话引擎，结合不同类型的代理交谈与人类很受欢迎。然而，对话是在这个意义上，选择的响应会改变会话上即时，影响在交谈中，这使得响应选定一个具有挑战性的问题，随后的话语动态。我们模型中选择从一组考虑到会话历史由异质的对话剂产生反应的最好回应的问题，并提出了\ {EMPH神经反应选择}方法。该方法被训练来预测一个会话中一套连贯的反应，考虑通过课程培训机构自己的预测。我们的实验结果表明，该方法能准确地选择最适当的回应，从而显著提高对话系统的用户体验。

23. Extracting Headless MWEs from Dependency Parse Trees: Parsing, Tagging, and Joint Modeling Approaches [PDF] 返回目录
Tianze Shi, Lillian Lee
Abstract: An interesting and frequent type of multi-word expression (MWE) is the headless MWE, for which there are no true internal syntactic dominance relations; examples include many named entities ("Wells Fargo") and dates ("July 5, 2020") as well as certain productive constructions ("blow for blow", "day after day"). Despite their special status and prevalence, current dependency-annotation schemes require treating such flat structures as if they had internal syntactic heads, and most current parsers handle them in the same fashion as headed constructions. Meanwhile, outside the context of parsing, taggers are typically used for identifying MWEs, but taggers might benefit from structural information. We empirically compare these two common strategies--parsing and tagging--for predicting flat MWEs. Additionally, we propose an efficient joint decoding algorithm that combines scores from both strategies. Experimental results on the MWE-Aware English Dependency Corpus and on six non-English dependency treebanks with frequent flat structures show that: (1) tagging is more accurate than parsing for identifying flat-structure MWEs, (2) our joint decoder reconciles the two different views and, for non-BERT features, leads to higher accuracies, and (3) most of the gains result from feature sharing between the parsers and taggers.
摘要：一个有趣的和常见的类型多字的表达（MWE）是无头MWE，对此有没有真正的内部句法优势关系;例子包括许多命名实体（“富国银行”）和日期（“2020年7月5日”）以及某些生产设施（“吹吹塑”，“日复一日”）。尽管他们的特殊地位和流行，目前依赖的注释方案需要治疗这种扁平结构，因为如果他们有内部语法头，以及最新的解析器处理它们以同样的方式为双头结构。同时，解析的上下文之外，标注器通常用于识别MWEs，但标注器可以从结构信息中受益。我们经验比较这两个共同的战略 - 解析和标记 - 预测持平MWEs。此外，我们提出了一个有效的联合解码算法，从两种策略结合了比分。在MWE感知英语依赖语料库和六个频繁平坦结构非英语依赖性树库实验结果表明：（1）标记是比分析用于识别平面结构MWEs更准确的，（2）我们的联合解码器和解两个不同的视图，并且对于非BERT特征，导致更高的精度，以及（3）大部分涨幅导致从解析器和标注器之间的特征共享。

24. Evaluating text coherence based on the graph of the consistency of phrases to identify symptoms of schizophrenia [PDF] 返回目录
Artem Kramov
Abstract: Different state-of-the-art methods of the detection of schizophrenia symptoms based on the estimation of text coherence have been analyzed. The analysis of a text at the level of phrases has been suggested. The method based on the graph of the consistency of phrases has been proposed to evaluate the semantic coherence and the cohesion of a text. The semantic coherence, cohesion, and other linguistic features (lexical diversity, lexical density) have been taken into account to form feature vectors for the training of a model-classifier. The training of the classifier has been performed on the set of English-language interviews. According to the retrieved results, the impact of each feature on the output of the model has been analyzed. The results obtained can indicate that the proposed method based on the graph of the consistency of phrases may be used in the different tasks of the detection of mental illness.
摘要：不同的国家的最先进的检测基于文本的一致性估计精神分裂症症状的方法进行了分析。在词组的水平文本的分析，已建议。已经提出了一种基于短语的一致性图形的方法来评估语义连贯和文本的凝聚力。语义相干性，内聚力，以及其他语言特征（词法分集，词汇密度）已经被考虑到的形式的特征矢量为模型分类器的训练。分类的训练对一套英语面试被执行的。根据检索的结果，每一个特征对模型的输出的影响进行了分析。所获得的结果可以指示可以在检测精神病的不同的任务可以使用基于短语的一致性的曲线图所提出的方法。

25. Successfully Applying the Stabilized Lottery Ticket Hypothesis to the Transformer Architecture [PDF] 返回目录
Christopher Brix, Parnia Bahar, Hermann Ney
Abstract: Sparse models require less memory for storage and enable a faster inference by reducing the necessary number of FLOPs. This is relevant both for time-critical and on-device computations using neural networks. The stabilized lottery ticket hypothesis states that networks can be pruned after none or few training iterations, using a mask computed based on the unpruned converged model. On the transformer architecture and the WMT 2014 English-to-German and English-to-French tasks, we show that stabilized lottery ticket pruning performs similar to magnitude pruning for sparsity levels of up to 85%, and propose a new combination of pruning techniques that outperforms all other techniques for even higher levels of sparsity. Furthermore, we confirm that the parameter's initial sign and not its specific value is the primary factor for successful training, and show that magnitude pruning cannot be used to find winning lottery tickets.
摘要：稀疏模式需要存储更少的内存，通过减少触发器的必要数量能够更快的推论。这是有关双方利用神经网络的时间关键和设备上的计算。稳定的彩票假说认为，网络可以无或很少训练迭代后使用掩模来计算基础上，未修剪的收敛模型进行修剪。变压器架构和WMT 2014英语到德语和英语到法国的任务，我们表明，稳定彩票修剪类似幅度修剪高达85％的稀疏水平执行，提出的修剪技术的新组合这对于优于更高层次的稀疏的所有其他技术。此外，我们可以证实，参数的初步迹象，而不是它的具体价值是成功的培训的首要因素，并表明幅度修剪不能被用来寻找中奖彩票。

26. Adaptive Dialog Policy Learning with Hindsight and User Modeling [PDF] 返回目录
Yan Cao, Keting Lu, Xiaoping Chen, Shiqi Zhang
Abstract: Reinforcement learning methods have been used to compute dialog policies from language-based interaction experiences. Efficiency is of particular importance in dialog policy learning, because of the considerable cost of interacting with people, and the very poor user experience from low-quality conversations. Aiming at improving the efficiency of dialog policy learning, we develop algorithm LHUA (Learning with Hindsight, User modeling, and Adaptation) that, for the first time, enables dialog agents to adaptively learn with hindsight from both simulated and real users. Simulation and hindsight provide the dialog agent with more experience and more (positive) reinforcements respectively. Experimental results suggest that, in success rate and policy quality, LHUA outperforms competitive baselines from the literature, including its no-simulation, no-adaptation, and no-hindsight counterparts.
摘要：强化学习方法已被用来计算从基于语言的交互体验对话框政策。效率是在对话政策的学习尤其重要，因为与人交际的相当大的成本，并从低质量的谈话用户体验非常差的。旨在提高对话政策学习的效率，我们开发的算法LHUA（学习事后，用户建模和适应性），对于第一次，使对话代理自适应地从两个模拟和真实用户事后学习。模拟和事后提供的对话框剂与更多的经验，更多的（正）分别增援。实验结果表明，在成功率和质量的政策，LHUA优于从文献中有竞争力的基线，包括它没有模拟，没有适应，和无事后同行。

27. YANG2UML: Bijective Transformation and Simplification of YANG to UML [PDF] 返回目录
Mario Golling, Robert Koch, Peter Hillmann, Rick Hofstede, Frank Tietze
Abstract: Software Defined Networking is currently revolutionizing computer networking by decoupling the network control (control plane) from the forwarding functions (data plane) enabling the network control to become directly programmable and the underlying infrastructure to be abstracted for applications and network services. Next to the well-known OpenFlow protocol, the XML-based NETCONF protocol is also an important means for exchanging configuration information from a management platform and is nowadays even part of OpenFlow. In combination with NETCONF, YANG is the corresponding protocol that defines the associated data structures supporting virtually all network configuration protocols. YANG itself is a semantically rich language, which -- in order to facilitate familiarization with the relevant subject -- is often visualized to involve other experts or developers and to support them by their daily work (writing applications which make use of YANG). In order to support this process, this paper presents an novel approach to optimize and simplify YANG data models to assist further discussions with the management and implementations (especially of interfaces) to reduce complexity. Therefore, we have defined a bidirectional mapping of YANG to UML and developed a tool that renders the created UML diagrams. This combines the benefits to use the formal language YANG with automatically maintained UML diagrams to involve other experts or developers, closing the gap between technically improved data models and their human readability.
摘要：软件定义网络目前由解耦从转发功能（数据平面）使网络控制成为直接可编程和底层基础架构的网络控制（控制平面）被抽象为应用程序和网络服务的革命性的计算机网络。旁边的公知的OpenFlow协议，基于XML的协议NETCONF也是从管理平台交换配置信息的重要手段，是现今开放流的甚至一部分。与NETCONF组合，杨是相应的协议，定义支持几乎所有网络配置协议相关联的数据结构。杨本身是语义丰富的语言，这 - 为了方便与相关学科的熟悉 - 通常是可视化涉及其他专家或开发人员，并通过他们的日常工作，以支持他们（编写应用程序，这使得利用杨的）。为了支持该方法中，提出了一种新颖的方法来优化和简化YANG数据模型，以协助管理和实现（特别是界面）的进一步讨论，以降低复杂性。因此，我们定义了杨的双向映射到UML开发一个工具，使创建UML图。这种结合的好处使用正式语言杨洁篪自动维护的UML图涉及其他专家还是开发商，收盘在技术上改进的数据模型和它们的易读性之间的差距。

28. RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions [PDF] 返回目录
Chung-Cheng Chiu, Arun Narayanan, Wei Han, Rohit Prabhavalkar, Yu Zhang, Navdeep Jaitly, Ruoming Pang, Tara N. Sainath, Patrick Nguyen, Liangliang Cao, Yonghui Wu
Abstract: In recent years, all-neural end-to-end approaches have obtained state-of-the-art results on several challenging automatic speech recognition (ASR) tasks. However, most existing works focus on building ASR models where train and test data are drawn from the same domain. This results in poor generalization characteristics on mismatched-domains: e.g., end-to-end models trained on short segments perform poorly when evaluated on longer utterances. In this work, we analyze the generalization properties of streaming and non-streaming recurrent neural network transducer (RNN-T) based end-to-end models in order to identify model components that negatively affect generalization performance. We propose two solutions: combining multiple regularization techniques during training, and using dynamic overlapping inference. On a long-form YouTube test set, when the non-streaming RNN-T model is trained with shorter segments of data, the proposed combination improves word error rate (WER) from 22.3% to 14.8%; when the streaming RNN-T model trained on short Search queries, the proposed techniques improve WER on the YouTube set from 67.0% to 25.3%. Finally, when trained on Librispeech, we find that dynamic overlapping inference improves WER on YouTube from 99.8% to 33.0%.
摘要：近年来，所有的神经终端到终端的方法已获得的一些具有挑战性的自动语音识别（ASR）任务的国家的最先进的成果。然而，大多数现有的作品着力构建地方训练和测试数据都来自同一个域绘制ASR模式。这导致差的概括特性上不匹配域：例如，在训练短段的端至端型号更长话语评价时表现不佳。在这项工作中，我们分析了流媒体和非流回归神经网络转换器（RNN-T）的端至高端机型，以确定泛化性能产生负面影响的模型组件的泛化性能。我们提出了两个解决方案：训练，在结合多个正则化技术，并采用动态重叠的推断。上的长形的YouTube测试装置，当非流RNN-T模型与数据的短训练段，所提出的组合改善了从22.3％至14.8％的字差错率（WER）;当流RNN-T型短搜索查询的训练，所提出的技术改进WER在YouTube上设置从67.0％到25.3％。最后，在Librispeech训练的时候，我们发现动态重叠推断提高WER在YouTube从99.8％到33.0％。

29. ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context [PDF] 返回目录
Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng Chiu, James Qin, Anmol Gulati, Ruoming Pang, Yonghui Wu
Abstract: Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind other state-of-the-art methods in performance. In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet. ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules. In addition, we propose a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy. We demonstrate that on the widely used LibriSpeech benchmark, ContextNet achieves a word error rate (WER) of 2.1\%/4.6\% without external language model (LM), 1.9\%/4.1\% with LM and 2.9\%/7.0\% with only 10M parameters on the clean/noisy LibriSpeech test sets. This compares to the previous best published system of 2.0\%/4.6\% with LM and 3.9\%/11.3\% with 20M parameters. The superiority of the proposed ContextNet model is also verified on a much larger internal dataset.
摘要：卷积神经网络（CNN）已经显示了端 - 端的语音识别有希望的结果，尽管仍然落后于其它国家的最先进的方法的性能。在本文中，我们研究如何缩小这一差距并超越了新的CNN-RNN换能器结构，我们称之为ContextNet。 ContextNet设有并入加入挤压和激发模块全局上下文信息到卷积层完全卷积编码器。此外，我们提出了一个简单的缩放方法是秤ContextNet是实现了很好的权衡计算和准确性之间的宽度。我们表明，广泛使用的LibriSpeech基准，ContextNet达到一个字错误率2.1 \％/ 4.6 \％（WER），无需外部语言模型（LM），1.9 \％/ 4.1 \％与LM和2.9 \％/ 7.0 \％与清洁/嘈杂LibriSpeech测试集只有10M参数。与此相比，2.0 \％/ 4.6 \％以前的最好的系统发布与LM和3.9 \％/ 11.3 \用20M参数％。所提出的ContextNet模式的优越性也验证了一个更大的内部数据集。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-05-08

目录

摘要