0%

【arxiv论文】 Computation and Language 2020-07-08

目录

1. What Gives the Answer Away? Question Answering Bias Analysis on Video QA Datasets [PDF] 摘要
2. An Emergency Medical Services Clinical Audit System driven by Named Entity Recognition from Deep Learning [PDF] 摘要
3. scb-mt-en-th-2020: A Large English-Thai Parallel Corpus [PDF] 摘要
4. The Go Transformer: Natural Language Modeling for Game Play [PDF] 摘要
5. Continual BERT: Continual Learning for Adaptive Extractive Summarization of COVID-19 Literature [PDF] 摘要
6. Research on Annotation Rules and Recognition Algorithm Based on Phrase Window [PDF] 摘要
7. Announcing CzEng 2.0 Parallel Corpus with over 2 Gigawords [PDF] 摘要
8. Cultural Convergence: Insights into the behavior of misinformation networks on Twitter [PDF] 摘要
9. Do Transformers Need Deep Long-Range Memory [PDF] 摘要
10. Structured (De)composable Representations Trained with Neural Networks [PDF] 摘要
11. DAM: Deliberation, Abandon and Memory Networks for Generating Detailed and Non-repetitive Responses in Visual Dialogue [PDF] 摘要
12. Labeling of Multilingual Breast MRI Reports [PDF] 摘要
13. Deep Contextual Embeddings for Address Classification in E-commerce [PDF] 摘要
14. Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters [PDF] 摘要

摘要

1. What Gives the Answer Away? Question Answering Bias Analysis on Video QA Datasets [PDF] 返回目录
  Jianing Yang, Yuying Zhu, Yongxin Wang, Ruitao Yi, Amir Zadeh, Louis-Philippe Morency
Abstract: Question answering biases in video QA datasets can mislead multimodal model to overfit to QA artifacts and jeopardize the model's ability to generalize. Understanding how strong these QA biases are and where they come from helps the community measure progress more accurately and provide researchers insights to debug their models. In this paper, we analyze QA biases in popular video question answering datasets and discover pretrained language models can answer 37-48% questions correctly without using any multimodal context information, far exceeding the 20% random guess baseline for 5-choose-1 multiple-choice questions. Our ablation study shows biases can come from annotators and type of questions. Specifically, annotators that have been seen during training are better predicted by the model and reasoning, abstract questions incur more biases than factual, direct questions. We also show empirically that using annotator-non-overlapping train-test splits can reduce QA biases for video QA datasets.
摘要:问答视频QA数据集的偏见会误导多模式模型过度拟合到QA文物和破坏模型的归纳能力。如何理解这些强QA偏见是和他们来自哪里更准确地帮助社区衡量进展情况,并为研究人员提供见解,调试他们的模型。在本文中,我们分析了流行的视频问题QA偏见回答的数据集,发现预训练的语言模型可以正确回答37-48%的问题,而无需使用任何多的上下文信息,远远超过了20%随机猜测基线5选1多重选择题。我们的消融研究表明偏见可以来自注释和类型的问题。具体而言,已经在训练中被视为注释通过模型和推理更好地预测,抽象的问题,招致不是实际的,直接的问题更多的偏见。我们还表明经验,使用注释,不重叠的列车测试拆分可以降低对视频QA数据集QA偏见。

2. An Emergency Medical Services Clinical Audit System driven by Named Entity Recognition from Deep Learning [PDF] 返回目录
  Wang Han, Wesley Yeung, Angeline Tung, Joey Tay Ai Meng, Davin Ryanputera, Feng Mengling, Shalini Arulanadam
Abstract: Clinical performance audits are routinely performed in Emergency Medical Services (EMS) to ensure adherence to treatment protocols, to identify individual areas of weakness for remediation, and to discover systemic deficiencies to guide the development of the training syllabus. At present, these audits are performed by manual chart review which is time-consuming and laborious. In this paper, we present an automatic audit system based on both the structured and unstructured ambulance case records and clinical notes with a deep neural network-based named entities recognition model. The dataset used in this study contained 58,898 unlabelled ambulance incidents encountered by the Singapore Civil Defence Force from 1st April 2019 to 30th June 2019. A weakly-supervised training approach was adopted to label the sentences. Later on, we trained three different models to perform the NER task. All three models achieve F1 scores of around 0.981 under entity type matching evaluation and around 0.976 under strict evaluation, while the BiLSTM-CRF model is 1~2 orders of magnitude lighter and faster than our BERT-based models. Overall, our approach yielded a named entity recognition model that could reliably identify clinical entities from unstructured paramedic free-text reports. Our proposed system may improve the efficiency of clinical performance audits and can also help with EMS database research.
摘要:临床绩效审计在紧急医疗服务(EMS)定期进行,以确保坚持治疗方案,找出弱点个别地区进行整治,并发现系统缺陷,指导训练大纲的发展。目前,这些审计由手动图审查这是费时和费力的执行。在本文中,我们提出了一种基于结构化和非结构化救护车病历和临床记录了深刻的基于神经网络的命名实体识别模型都自动审核系统。在这项研究中所使用的数据集包含从2019年4月1日由新加坡民防部队遇到的6月30日2019年弱监督训练方法58898个未标记的救护车事件通过了标记的句子。后来,我们训练的三种不同的模式来执行NER任务。这三种型号的实现在实体类型匹配评价周围0.981和0.976周围F1分数进行严格的评估,而BiLSTM-CRF模型幅度较轻的1〜2个数量级,比我们的基于BERT的模型更快。总体而言,我们的做法产生了命名实体识别模型,可以可靠地从非结构化护理人员自由文本报告确定的临床实体。我们所提出的系统可提高临床绩效审计的效率,也可以用EMS数据库研究的帮助。

3. scb-mt-en-th-2020: A Large English-Thai Parallel Corpus [PDF] 返回目录
  Lalita Lowphansirikul, Charin Polpanumas, Attapol T. Rutherford, Sarana Nutanong
Abstract: The primary objective of our work is to build a large-scale English-Thai dataset for machine translation. We construct an English-Thai machine translation dataset with over 1 million segment pairs, curated from various sources, namely news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data and government documents. Methodology for gathering data, building parallel texts and removing noisy sentence pairs are presented in a reproducible manner. We train machine translation models based on this dataset. Our models' performance are comparable to that of Google Translation API (as of May 2020) for Thai-English and outperform Google when the Open Parallel Corpus (OPUS) is included in the training data for both Thai-English and English-Thai translation. The dataset, pre-trained models, and source code to reproduce our work are available for public use.
摘要:我们工作的主要目标是建立机器翻译大规模英语泰数据集。我们构建一个英语泰机器翻译数据集超过100万的线段对,从不同的来源,即新闻,维基百科文章,短信,基于任务的对话框,网页抓取数据和政府文件策划。方法用于收集数据,构建平行文本以及去除噪声的句子对被以可再现的方式呈现。我们培养基于该数据集的机器翻译模型。我们的模型的性能相媲美,谷歌翻译API的(在2020年五月),泰国 - 英语和当打开平行语料库(OPUS)被包括在泰语 - 英语和英语,泰语翻译训练数据优于谷歌。该数据集,预先训练模式,和源代码复制我们的工作可供公众使用。

4. The Go Transformer: Natural Language Modeling for Game Play [PDF] 返回目录
  David Noever, Matthew Ciolino, Josh Kalin
Abstract: This work applies natural language modeling to generate plausible strategic moves in the ancient game of Go. We train the Generative Pretrained Transformer (GPT-2) to mimic the style of Go champions as archived in Smart Game Format (SGF), which offers a text description of move sequences. The trained model further generates valid but previously unseen strategies for Go. Because GPT-2 preserves punctuation and spacing, the raw output of the text generator provides inputs to game visualization and creative patterns, such as the Sabaki project's (2020) game engine using auto-replays. Results demonstrate that language modeling can capture both the sequencing format of championship Go games and their strategic formations. Compared to random game boards, the GPT-2 fine-tuning shows efficient opening move sequences favoring corner play over less advantageous center and side play. Game generation as a language modeling task offers novel approaches to more than 40 other board games where historical text annotation provides training data (e.g., Amazons & Connect 4/6).
摘要:该作品适用于自然语言模型来生成围棋的古老的游戏似是而非的战略举措。我们培养成式预训练变压器(GPT-2),以模仿围棋冠军的风格,如智能游戏格式(SGF),提供移动序列的文本描述存档。训练的模型还产生了围棋有效的,但是以前看不到的策略。因为GPT-2蜜饯标点和间距,文本生成器的原始输出提供输入到游戏可视化和创造性的图案,如Sabaki项目(2020)的游戏引擎使用自动重播。结果表明,语言模型可以同时捕获的冠军围棋游戏测序格式和他们的战略结构。相比随机游戏板,所述GPT-2微调显示有效开口运动序列利于角播放过少有利中心和侧隙。游戏一代建模任务提供新的方法,以40个余家桌游,历史悠久的文字注释提供训练数据(例如,亚马逊和连接4/6)的语言。

5. Continual BERT: Continual Learning for Adaptive Extractive Summarization of COVID-19 Literature [PDF] 返回目录
  Jong Won Park
Abstract: The scientific community continues to publish an overwhelming amount of new research related to COVID-19 on a daily basis, leading to much literature without little to no attention. To aid the community in understanding the rapidly flowing array of COVID-19 literature, we propose a novel BERT architecture that provides a brief yet original summarization of lengthy papers. The model continually learns on new data in online fashion while minimizing catastrophic forgetting, thus fitting to the need of the community. Benchmark and manual examination of its performance show that the model provide a sound summary of new scientific literature.
摘要:科学界继续发布有关COVID-19每天都在新的研究中压倒性的数量,从而导致许多文献没有几乎没有任何关注。为了帮助社区了解COVID-19文献​​的快速流动数组,我们提出了一个新颖的BERT架构,可提供简短而冗长的论文原始总结。该模型在网上时装新的数据不断地学习,同时减轻灾难性的遗忘,从而拟合需要社会的。基准和性能表明,该模型提供了新的科学文献的声音总结手动检查。

6. Research on Annotation Rules and Recognition Algorithm Based on Phrase Window [PDF] 返回目录
  Guang Liu, Gang Tu, Zheng Li, Yi-Jian Liu
Abstract: At present, most Natural Language Processing technology is based on the results of Word Segmentation for Dependency Parsing, which mainly uses an end-to-end method based on supervised learning. There are two main problems with this method: firstly, the la-beling rules are complex and the data is too difficult to label, the workload of which is large; secondly, the algorithm cannot recognize the multi-granularity and diversity of language components. In order to solve these two problems, we propose labeling rules based on phrase windows, and designed corresponding phrase recognition algorithms. The labeling rule uses phrases as the minimum unit, di-vides sentences into 7 types of nestable phrase types, and marks the grammatical dependencies between phrases. The corresponding algorithm, drawing on the idea of identifying the target area in the image field, can find the start and end positions of various phrases in the sentence, and realize the synchronous recognition of nested phrases and grammatical dependencies. The results of the experiment shows that the labeling rule is convenient and easy to use, and there is no ambiguity; the algorithm is more grammatically multi-granular and diverse than the end-to-end algorithm. Experiments on the CPWD dataset improve the accuracy of the end-to-end method by about 1 point. The corresponding method was applied to the CCL2018 competition, and the first place in the Chinese Metaphor Sentiment Analysis Task.
摘要:目前,最自然语言处理技术是基于分词的结果依存分析,主要使用基于监督学习结束到终端的方法。没有与此方法的两个主要问题:第一,LA-乱棍规则是复杂的并且数据是太难标签,它的工作量较大;其次,该算法无法识别语言成分的多粒度和多样性。为了解决这两个问题,我们提出了一种基于短语的窗口标签规则,并设计相应的短语识别算法。标记规则使用的短语作为最小单位,二志愿组织句子翻译成7种类型的可嵌套的短语类型,和标记短语之间的语法的依赖关系。相应的算法,对识别所述图象场中目标区域的想法拉丝,可以找到在句子的各种短语的开始和结束位置,并实现嵌套短语和语法相关性的同步识别。的实验表明,该标签规则是方便,易于使用,并没有歧义的结果;该算法是比端至端算法更语法多粒度多样。在CPWD实验数据集由约1点提高终端到终端的方法的准确度。相应的方法应用于CCL2018竞争,并在中国隐喻情感分析任务的首位。

7. Announcing CzEng 2.0 Parallel Corpus with over 2 Gigawords [PDF] 返回目录
  Tom Kocmi, Martin Popel, Ondrej Bojar
Abstract: We present a new release of the Czech-English parallel corpus CzEng 2.0 consisting of over 2 billion words (2 "gigawords") in each language. The corpus contains document-level information and is filtered with several techniques to lower the amount of noise. In addition to the data in the previous version of CzEng, it contains new authentic and also high-quality synthetic parallel data. CzEng is freely available for research and educational purposes.
摘要:我们提出捷克英语平行语料库CzEng 2.0包括在每种语言超过2十亿字(2“gigawords”)的新版本。胼包含文档级信息,并且过滤与几种技术来降低噪声量。除了在以前的版本CzEng的数据,它包含全新正品,也具有高品质的合成并行数据。 CzEng是免费提供用于研究和教育目的。

8. Cultural Convergence: Insights into the behavior of misinformation networks on Twitter [PDF] 返回目录
  Liz McQuillan, Erin McAweeney, Alicia Bargar, Alex Ruch
Abstract: How can the birth and evolution of ideas and communities in a network be studied over time? We use a multimodal pipeline, consisting of network mapping, topic modeling, bridging centrality, and divergence to analyze Twitter data surrounding the COVID-19 pandemic. We use network mapping to detect accounts creating content surrounding COVID-19, then Latent Dirichlet Allocation to extract topics, and bridging centrality to identify topical and non-topical bridges, before examining the distribution of each topic and bridge over time and applying Jensen-Shannon divergence of topic distributions to show communities that are converging in their topical narratives.
摘要:如何在网络中的想法和社区的诞生和演变进行研究随着时间的推移?我们用一个多管道,包括网络映射,主题建模,桥中心地位,并且偏离分析周边COVID-19大流行的Twitter数据。我们使用网络映射,以检测帐户COVID-19,则隐含狄利克雷分布周边,以提取主题创建内容,和桥接中心地位,以确定局部和非局部桥梁,检查每个主题的分配前后桥随着时间的推移和应用詹森 - 香农话题分布的差异表明,在他们的专题叙事融合社区。

9. Do Transformers Need Deep Long-Range Memory [PDF] 返回目录
  Jack W. Rae, Ali Razavi
Abstract: Deep attention models have advanced the modelling of sequential data across many domains. For language modelling in particular, the Transformer-XL -- a Transformer augmented with a long-range memory of past activations -- has been shown to be state-of-the-art across a variety of well-studied benchmarks. The Transformer-XL incorporates a long-range memory at every layer of the network, which renders its state to be thousands of times larger than RNN predecessors. However it is unclear whether this is necessary. We perform a set of interventions to show that comparable performance can be obtained with 6X fewer long range memories and better performance can be obtained by limiting the range of attention in lower layers of the network.
摘要:深关注车型拥有先进的时序数据的建模在许多领域。用于特定语言建模中,变压器-XL - 一个变压器增强与过去的激活的远程存储器 - 已被证明是国家的最先进的跨各种充分研究的基准。变压器-XL在网络的每个层,这使得它的状态为几千倍RNN前辈更大集成了远程存储器。不过,目前尚不清楚这是否是必要的。我们进行一系列的干预措施表明相当的性能可以与6X获得较少的长程记忆和可通过限制网络的较低层关注的范围,能够得到更好的性能。

10. Structured (De)composable Representations Trained with Neural Networks [PDF] 返回目录
  Graham Spinks, Marie-Francine Moens
Abstract: The paper proposes a novel technique for representing templates and instances of concept classes. A template representation refers to the generic representation that captures the characteristics of an entire class. The proposed technique uses end-to-end deep learning to learn structured and composable representations from input images and discrete labels. The obtained representations are based on distance estimates between the distributions given by the class label and those given by contextual information, which are modeled as environments. We prove that the representations have a clear structure allowing to decompose the representation into factors that represent classes and environments. We evaluate our novel technique on classification and retrieval tasks involving different modalities (visual and language data).
摘要:提出一种表示模板和概念类的实例的新颖技术。模板表示指的是捕捉一整类的特性的通用表示。所提出的技术使用终端到端到端深学会学习,从输入图像和离散标签的结构和组合的表示。将所得到的表示被根据由类别标签和那些由上下文信息,这被建模为给定的环境中给出的分布之间的距离的估计。我们证明,表示有一个清晰的结构允许表示分解为代表的类和环境因素。我们评估我们的分类和涉及不同的模式(视觉和语言数据)检索任务的新技术。

11. DAM: Deliberation, Abandon and Memory Networks for Generating Detailed and Non-repetitive Responses in Visual Dialogue [PDF] 返回目录
  Xiaoze Jiang, Jing Yu, Yajing Sun, Zengchang Qin, Zihao Zhu, Yue Hu, Qi Wu
Abstract: Visual Dialogue task requires an agent to be engaged in a conversation with human about an image. The ability of generating detailed and non-repetitive responses is crucial for the agent to achieve human-like conversation. In this paper, we propose a novel generative decoding architecture to generate high-quality responses, which moves away from decoding the whole encoded semantics towards the design that advocates both transparency and flexibility. In this architecture, word generation is decomposed into a series of attention-based information selection steps, performed by the novel recurrent Deliberation, Abandon and Memory (DAM) module. Each DAM module performs an adaptive combination of the response-level semantics captured from the encoder and the word-level semantics specifically selected for generating each word. Therefore, the responses contain more detailed and non-repetitive descriptions while maintaining the semantic accuracy. Furthermore, DAM is flexible to cooperate with existing visual dialogue encoders and adaptive to the encoder structures by constraining the information selection mode in DAM. We apply DAM to three typical encoders and verify the performance on the VisDial v1.0 dataset. Experimental results show that the proposed models achieve new state-of-the-art performance with high-quality responses. The code is available at this https URL.
摘要:视觉对话任务需要一个代理来与人类从事的对话,谈论的图像。生成详细的和非重复响应的能力是至关重要的代理来实现类似人类的对话。在本文中,我们提出了一种新生成的解码体系结构,以产生高质量的应答,它从朝一个主张透明度和柔韧性设计整个编码语义解码移开。在此架构中,字生成被分解成一系列的基于注意机制的信息选择步骤,由所述新颖反复进行思考,放弃和存储器(DAM)模块。每个DAM模块执行专门用于生成每个字选自编码器和字级语义捕获的响应级语义的自适应组合。因此,响应包含更详细和非重复性的描述,同时保持语义的准确度。此外,DAM是柔性的以通过约束在DAM信息选择模式现有可视对话编码器和自适应于编码器结构配合。我们应用坝三个典型的编码器和核验VisDial V1.0数据集的性能。实验结果表明,该模型实现国家的最先进的新的性能与高品质的响应。该代码可在此HTTPS URL。

12. Labeling of Multilingual Breast MRI Reports [PDF] 返回目录
  Chen-Han Tsai, Nahum Kiryati, Eli Konen, Miri Sklair-Levy, Arnaldo Mayer
Abstract: Medical reports are an essential medium in recording a patient's condition throughout a clinical trial. They contain valuable information that can be extracted to generate a large labeled dataset needed for the development of clinical tools. However, the majority of medical reports are stored in an unregularized format, and a trained human annotator (typically a doctor) must manually assess and label each case, resulting in an expensive and time consuming procedure. In this work, we present a framework for developing a multilingual breast MRI report classifier using a custom-built language representation called LAMBR. Our proposed method overcomes practical challenges faced in clinical settings, and we demonstrate improved performance in extracting labels from medical reports when compared with conventional approaches.
摘要:医学报告是在整个临床试验记录病人的病情的重要媒介。它们含有可以提取生成所需的临床工具的开发大数据集标记有价值的信息。然而,大多数的医疗报告被存储在非正规格式,一个训练有素的人注释器(通常为医生)必须手动评估和标签每种情况下,导致昂贵且耗时的过程。在这项工作中,我们提出了开发利用所谓LAMBR定制的语言表示一个多语种的乳腺MRI报告分类的框架。我们提出的方法克服面临在临床实践的挑战,我们证明与传统方法相比,提取医疗报告标签更高的性能。

13. Deep Contextual Embeddings for Address Classification in E-commerce [PDF] 返回目录
  Shreyas Mangalgi, Lakshya Kumar, Ravindra Babu Tallamraju
Abstract: E-commerce customers in developing nations like India tend to follow no fixed format while entering shipping addresses. Parsing such addresses is challenging because of a lack of inherent structure or hierarchy. It is imperative to understand the language of addresses, so that shipments can be routed without delays. In this paper, we propose a novel approach towards understanding customer addresses by deriving motivation from recent advances in Natural Language Processing (NLP). We also formulate different pre-processing steps for addresses using a combination of edit distance and phonetic algorithms. Then we approach the task of creating vector representations for addresses using Word2Vec with TF-IDF, Bi-LSTM and BERT based approaches. We compare these approaches with respect to sub-region classification task for North and South Indian cities. Through experiments, we demonstrate the effectiveness of generalized RoBERTa model, pre-trained over a large address corpus for language modelling task. Our proposed RoBERTa model achieves a classification accuracy of around 90% with minimal text preprocessing for sub-region classification task outperforming all other approaches. Once pre-trained, the RoBERTa model can be fine-tuned for various downstream tasks in supply chain like pincode suggestion and geo-coding. The model generalizes well for such tasks even with limited labelled data. To the best of our knowledge, this is the first of its kind research proposing a novel approach of understanding customer addresses in e-commerce domain by pre-training language models and fine-tuning them for different purposes.
摘要:电子商务客户开发像印度这样的国家往往遵循没有固定的格式,而进入发货地址。解析这些地址是由于缺乏内在结构或层次的挑战。这是必须了解的地址的语言,这样的出货量可以无延迟地传送。在本文中,我们通过从自然语言处理(NLP)的最新进展获得动机提出对于理解客户地址的新方法。我们还制定使用编辑距离和语音算法的组合对地址的不同预处理步骤。然后,我们接近了使用Word2Vec与TF-IDF,双LSTM和基于BERT的方法创建的地址向量表示的任务。我们比较相对于北美和印度南部城市分区域分类任务这些方法。通过实验,我们证明广义罗伯塔模型的有效性,预先训练了语言建模任务的大型语料库地址。我们提出的罗伯塔模型实现了90%左右,以最小的文本预处理的次区域分类任务优于所有其他方法分类的准确性。一旦预先训练,该罗伯塔模型可以被微调在供应链中的各个下游任务,如PIN代码建议和地理编码。该模型概括以及此类任务即使在有限的标签数据。据我们所知,这是首个同类型的研究提出了解客户的地址在电子商务领域由前训练语言模型和微调它们用于不同目的的新方法的。

14. Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters [PDF] 返回目录
  Vineel Pratap, Anuroop Sriram, Paden Tomasello, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert
Abstract: We study training a single acoustic model for multiple languages with the aim of improving automatic speech recognition (ASR) performance on low-resource languages, and over-all simplifying deployment of ASR systems that support diverse languages. We perform an extensive benchmark on 51 languages, with varying amount of training data by language(from 100 hours to 1100 hours). We compare three variants of multilingual training from a single joint model without knowing the input language, to using this information, to multiple heads (one per language cluster). We show that multilingual training of ASR models on several languages can improve recognition performance, in particular, on low resource languages. We see 20.9%, 23% and 28.8% average WER relative reduction compared to monolingual baselines on joint model, joint model with language input and multi head model respectively. To our knowledge, this is the first work studying multilingual ASR at massive scale, with more than 50 languages and more than 16,000 hours of audio across them.
摘要:学习培训多语言的单一声学模型与改进的低资源语言自动语音识别(ASR)的性能,和过所有支持不同的语言ASR系统的简化部署的目的。我们对51种语言进行了广泛的基准测试,通过语言(从100小时1100小时)不同的训练数据量。我们比较了从单一的联合模式多语种培训的三个变种不知道输入的语言,使用该信息,以多头(每种语言集群之一)。我们发现,在几种语言ASR机型的多语种培训可以提高识别性能,特别是在低资源语言。我们看到,20.9%,23%和28.8%的平均WER相对减少相比,联合的模式,分别为语言输入接口及多头部模型的联合模式单一语言的基线。据我们所知,这是第一次在工作的大规模研究多语种ASR,有超过50种语言和16000小时以上通过它们的音频。

注:中文为机器翻译结果!封面为论文标题词云图!