目录
摘要
1. KILT: a Benchmark for Knowledge Intensive Language Tasks [PDF] 返回目录
Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vassilis Plachouras, Tim Rocktäschel, Sebastian Riedel
Abstract: Challenging problems such as open-domain question answering, fact checking, slot filling and entity linking require access to large, external knowledge sources. While some models do well on individual tasks, developing general models is difficult as each task might require computationally expensive indexing of custom knowledge sources, in addition to dedicated infrastructure. To catalyze research on models that condition on specific information in large textual resources, we present a benchmark for knowledge-intensive language tasks (KILT). All tasks in KILT are grounded in the same snapshot of Wikipedia, reducing engineering turnaround through the re-use of components, as well as accelerating research into task-agnostic memory architectures. We test both task-specific and general baselines, evaluating downstream performance in addition to the ability of the models to provide provenance. We find that a shared dense vector index coupled with a seq2seq model is a strong baseline, outperforming more tailor-made approaches for fact checking, open-domain question answering and dialogue, and yielding competitive results on entity linking and slot filling, by generating disambiguated text. KILT data and code are available at this https URL.
摘要:具有挑战性的问题,如开放域问答系统,核对事实,槽分配和实体链接需要访问到大,外部知识源。虽然有些车型在各个任务做得很好,发展通用模型是困难的,因为每个任务可能需要定制的知识来源,计算量很大的索引,除了专用的基础设施。要在模型催化研究认为在大的文本资源的特定信息相关的条件,我们提出了一个知识密集的语言任务(KILT)的基准。在KILT所有任务都植根于维基百科的同一快照,通过重新使用组件降低工程周转,以及加快研究任务无关的存储器架构。我们测试任务的具体和一般的基线两者评估除了车型提供出处的能力,下游表现。我们发现,再加上seq2seq模型共享密向量索引是一种强基线,表现好于事实检查,开放域问答和对话更有针对性的方法,并在实体链接和槽填充产生竞争的结果,通过生成消歧文本。 KILT数据和代码可在此HTTPS URL。
Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vassilis Plachouras, Tim Rocktäschel, Sebastian Riedel
Abstract: Challenging problems such as open-domain question answering, fact checking, slot filling and entity linking require access to large, external knowledge sources. While some models do well on individual tasks, developing general models is difficult as each task might require computationally expensive indexing of custom knowledge sources, in addition to dedicated infrastructure. To catalyze research on models that condition on specific information in large textual resources, we present a benchmark for knowledge-intensive language tasks (KILT). All tasks in KILT are grounded in the same snapshot of Wikipedia, reducing engineering turnaround through the re-use of components, as well as accelerating research into task-agnostic memory architectures. We test both task-specific and general baselines, evaluating downstream performance in addition to the ability of the models to provide provenance. We find that a shared dense vector index coupled with a seq2seq model is a strong baseline, outperforming more tailor-made approaches for fact checking, open-domain question answering and dialogue, and yielding competitive results on entity linking and slot filling, by generating disambiguated text. KILT data and code are available at this https URL.
摘要:具有挑战性的问题,如开放域问答系统,核对事实,槽分配和实体链接需要访问到大,外部知识源。虽然有些车型在各个任务做得很好,发展通用模型是困难的,因为每个任务可能需要定制的知识来源,计算量很大的索引,除了专用的基础设施。要在模型催化研究认为在大的文本资源的特定信息相关的条件,我们提出了一个知识密集的语言任务(KILT)的基准。在KILT所有任务都植根于维基百科的同一快照,通过重新使用组件降低工程周转,以及加快研究任务无关的存储器架构。我们测试任务的具体和一般的基线两者评估除了车型提供出处的能力,下游表现。我们发现,再加上seq2seq模型共享密向量索引是一种强基线,表现好于事实检查,开放域问答和对话更有针对性的方法,并在实体链接和槽填充产生竞争的结果,通过生成消歧文本。 KILT数据和代码可在此HTTPS URL。
2. Going Beyond T-SNE: Exposing \texttt{whatlies} in Text Embeddings [PDF] 返回目录
Vincent D. Warmerdam, Thomas Kober, Rachael Tatman
Abstract: We introduce whatlies, an open source toolkit for visually inspecting word and sentence embeddings. The project offers a unified and extensible API with current support for a range of popular embedding backends including spaCy, tfhub, huggingface transformers, gensim, fastText and BytePair embeddings. The package combines a domain specific language for vector arithmetic with visualisation tools that make exploring word embeddings more intuitive and concise. It offers support for many popular dimensionality reduction techniques as well as many interactive visualisations that can either be statically exported or shared via Jupyter notebooks. The project documentation is available from this https URL.
摘要:介绍whatlies,一个开源工具包,用于目视检查单词和句子的嵌入。该项目提供了面向各种流行的嵌入后端包括spaCy,tfhub,huggingface变压器,gensim,fastText和BytePair的嵌入的电流支持一个统一的,可扩展的API。该封装中整合与可视化工具,使探索字的嵌入更直观,简洁的矢量运算领域特定语言。它提供了许多流行的降维技术,以及它们可以是静态的出口或通过Jupyter笔记本电脑共享许多交互式可视化的支持。该项目文档可从该HTTPS URL。
Vincent D. Warmerdam, Thomas Kober, Rachael Tatman
Abstract: We introduce whatlies, an open source toolkit for visually inspecting word and sentence embeddings. The project offers a unified and extensible API with current support for a range of popular embedding backends including spaCy, tfhub, huggingface transformers, gensim, fastText and BytePair embeddings. The package combines a domain specific language for vector arithmetic with visualisation tools that make exploring word embeddings more intuitive and concise. It offers support for many popular dimensionality reduction techniques as well as many interactive visualisations that can either be statically exported or shared via Jupyter notebooks. The project documentation is available from this https URL.
摘要:介绍whatlies,一个开源工具包,用于目视检查单词和句子的嵌入。该项目提供了面向各种流行的嵌入后端包括spaCy,tfhub,huggingface变压器,gensim,fastText和BytePair的嵌入的电流支持一个统一的,可扩展的API。该封装中整合与可视化工具,使探索字的嵌入更直观,简洁的矢量运算领域特定语言。它提供了许多流行的降维技术,以及它们可以是静态的出口或通过Jupyter笔记本电脑共享许多交互式可视化的支持。该项目文档可从该HTTPS URL。
3. Linguistically inspired morphological inflection with a sequence to sequence model [PDF] 返回目录
Eleni Metheniti, Guenter Neumann, Josef van Genabith
Abstract: Inflection is an essential part of every human language's morphology, yet little effort has been made to unify linguistic theory and computational methods in recent years. Methods of string manipulation are used to infer inflectional changes; our research question is whether a neural network would be capable of learning inflectional morphemes for inflection production in a similar way to a human in early stages of language acquisition. We are using an inflectional corpus (Metheniti and Neumann, 2020) and a single layer seq2seq model to test this hypothesis, in which the inflectional affixes are learned and predicted as a block and the word stem is modelled as a character sequence to account for infixation. Our character-morpheme-based model creates inflection by predicting the stem character-to-character and the inflectional affixes as character blocks. We conducted three experiments on creating an inflected form of a word given the lemma and a set of input and target features, comparing our architecture to a mainstream character-based model with the same hyperparameters, training and test sets. Overall for 17 languages, we noticed small improvements on inflecting known lemmas (+0.68%) but steadily better performance of our model in predicting inflected forms of unknown words (+3.7%) and small improvements on predicting in a low-resource scenario (+1.09%)
摘要:拐点是每个人的语言的形态的一个重要组成部分,但小的努力已经取得了统一的语言理论和计算方法在最近几年。字符串操作方法被用来推断屈折变化;我们研究的问题是神经网络是否能够在语言习得的早期阶段学习拐点生产屈折语素以类似的方式给人类的。我们使用的是屈折语料库(Metheniti和Neumann,2020)和单层seq2seq模型来检验这一假设,其中,所述屈折词缀被学习和预测作为一个块和词干被建模为一个字符序列到帐户中缀。我们基于字符的语素模型通过预测干字符至字符和屈折词缀的字符块创建拐点。我们打造给出的引理和一组输入和目标特征的单词的词尾变化的形式,我们的架构比较基于字符的主流机型与同超参数,训练组和测试组进行了三次试验。总体来说17种语言,我们在(抑扬知引理(+ 0.68%),但我们的模型稳步更好的性能预测上的低资源情况预测的未知字(+ 3.7%)和小的改进屈折形式注意到小的改善+ 1.09%)
Eleni Metheniti, Guenter Neumann, Josef van Genabith
Abstract: Inflection is an essential part of every human language's morphology, yet little effort has been made to unify linguistic theory and computational methods in recent years. Methods of string manipulation are used to infer inflectional changes; our research question is whether a neural network would be capable of learning inflectional morphemes for inflection production in a similar way to a human in early stages of language acquisition. We are using an inflectional corpus (Metheniti and Neumann, 2020) and a single layer seq2seq model to test this hypothesis, in which the inflectional affixes are learned and predicted as a block and the word stem is modelled as a character sequence to account for infixation. Our character-morpheme-based model creates inflection by predicting the stem character-to-character and the inflectional affixes as character blocks. We conducted three experiments on creating an inflected form of a word given the lemma and a set of input and target features, comparing our architecture to a mainstream character-based model with the same hyperparameters, training and test sets. Overall for 17 languages, we noticed small improvements on inflecting known lemmas (+0.68%) but steadily better performance of our model in predicting inflected forms of unknown words (+3.7%) and small improvements on predicting in a low-resource scenario (+1.09%)
摘要:拐点是每个人的语言的形态的一个重要组成部分,但小的努力已经取得了统一的语言理论和计算方法在最近几年。字符串操作方法被用来推断屈折变化;我们研究的问题是神经网络是否能够在语言习得的早期阶段学习拐点生产屈折语素以类似的方式给人类的。我们使用的是屈折语料库(Metheniti和Neumann,2020)和单层seq2seq模型来检验这一假设,其中,所述屈折词缀被学习和预测作为一个块和词干被建模为一个字符序列到帐户中缀。我们基于字符的语素模型通过预测干字符至字符和屈折词缀的字符块创建拐点。我们打造给出的引理和一组输入和目标特征的单词的词尾变化的形式,我们的架构比较基于字符的主流机型与同超参数,训练组和测试组进行了三次试验。总体来说17种语言,我们在(抑扬知引理(+ 0.68%),但我们的模型稳步更好的性能预测上的低资源情况预测的未知字(+ 3.7%)和小的改进屈折形式注意到小的改善+ 1.09%)
4. AutoTrans: Automating Transformer Design via Reinforced Architecture Search [PDF] 返回目录
Wei Zhu, Xiaoling Wang, Xipeng Qiu, Yuan Ni, Guotong Xie
Abstract: Though the transformer architectures have shown dominance in many natural language understanding tasks, there are still unsolved issues for the training of transformer models, especially the need for a principled way of warm-up which has shown importance for stable training of a transformer, as well as whether the task at hand prefer to scale the attention product or not. In this paper, we empirically explore automating the design choices in the transformer model, i.e., how to set layer-norm, whether to scale, number of layers, number of heads, activation function, etc, so that one can obtain a transformer architecture that better suits the tasks at hand. RL is employed to navigate along search space, and special parameter sharing strategies are designed to accelerate the search. It is shown that sampling a proportion of training data per epoch during search help to improve the search quality. Experiments on the CoNLL03, Multi-30k, IWSLT14 and WMT-14 shows that the searched transformer model can outperform the standard transformers. In particular, we show that our learned model can be trained more robustly with large learning rates without warm-up.
摘要:虽然变压器的架构在许多自然语言理解任务显示的主导地位,仍然有变压器模型的培训,尤其需要这已经显示出了变压器的稳定训练的重要性热身的原则的方式解决的问题,以及是否手头的任务更愿意扩展注重产品与否。在本文中,我们凭经验探索在变压器模型自动化设计选择,即,如何设置层范数,是否规模,层数,磁头数,激活函数,等等,以使得可以获得的变压器结构这更适合手头上的任务。 RL采用沿搜索空间导航和特殊参数共享战略的目的是加速搜索。结果表明,搜索帮助在采样每时期的训练数据的比例,以改善搜索质量。在CoNLL03,多30K,IWSLT14和WMT-14显示搜索到的变压器模型可以超越标准的变压器实验。特别是,我们证明了我们的学习模型,可以更有力地与大学习速率没有热身训练。
Wei Zhu, Xiaoling Wang, Xipeng Qiu, Yuan Ni, Guotong Xie
Abstract: Though the transformer architectures have shown dominance in many natural language understanding tasks, there are still unsolved issues for the training of transformer models, especially the need for a principled way of warm-up which has shown importance for stable training of a transformer, as well as whether the task at hand prefer to scale the attention product or not. In this paper, we empirically explore automating the design choices in the transformer model, i.e., how to set layer-norm, whether to scale, number of layers, number of heads, activation function, etc, so that one can obtain a transformer architecture that better suits the tasks at hand. RL is employed to navigate along search space, and special parameter sharing strategies are designed to accelerate the search. It is shown that sampling a proportion of training data per epoch during search help to improve the search quality. Experiments on the CoNLL03, Multi-30k, IWSLT14 and WMT-14 shows that the searched transformer model can outperform the standard transformers. In particular, we show that our learned model can be trained more robustly with large learning rates without warm-up.
摘要:虽然变压器的架构在许多自然语言理解任务显示的主导地位,仍然有变压器模型的培训,尤其需要这已经显示出了变压器的稳定训练的重要性热身的原则的方式解决的问题,以及是否手头的任务更愿意扩展注重产品与否。在本文中,我们凭经验探索在变压器模型自动化设计选择,即,如何设置层范数,是否规模,层数,磁头数,激活函数,等等,以使得可以获得的变压器结构这更适合手头上的任务。 RL采用沿搜索空间导航和特殊参数共享战略的目的是加速搜索。结果表明,搜索帮助在采样每时期的训练数据的比例,以改善搜索质量。在CoNLL03,多30K,IWSLT14和WMT-14显示搜索到的变压器模型可以超越标准的变压器实验。特别是,我们证明了我们的学习模型,可以更有力地与大学习速率没有热身训练。
5. Dynamic Context-guided Capsule Network for Multimodal Machine Translation [PDF] 返回目录
Huan Lin, Fandong Meng, Jinsong Su, Yongjing Yin, Zhengyuan Yang, Yubin Ge, Jie Zhou, Jiebo Luo
Abstract: Multimodal machine translation (MMT), which mainly focuses on enhancing text-only translation with visual features, has attracted considerable attention from both computer vision and natural language processing communities. Most current MMT models resort to attention mechanism, global context modeling or multimodal joint representation learning to utilize visual features. However, the attention mechanism lacks sufficient semantic interactions between modalities while the other two provide fixed visual context, which is unsuitable for modeling the observed variability when generating translation. To address the above issues, in this paper, we propose a novel Dynamic Context-guided Capsule Network (DCCN) for MMT. Specifically, at each timestep of decoding, we first employ the conventional source-target attention to produce a timestep-specific source-side context vector. Next, DCCN takes this vector as input and uses it to guide the iterative extraction of related visual features via a context-guided dynamic routing mechanism. Particularly, we represent the input image with global and regional visual features, we introduce two parallel DCCNs to model multimodal context vectors with visual features at different granularities. Finally, we obtain two multimodal context vectors, which are fused and incorporated into the decoder for the prediction of the target word. Experimental results on the Multi30K dataset of English-to-German and English-to-French translation demonstrate the superiority of DCCN. Our code is available on this https URL.
摘要:多式联运机器翻译(MMT),其中主要集中在提高纯文本翻译与视觉特征,吸引了相当多的关注来自计算机视觉和自然语言处理社区。目前大多数MMT模型诉诸注意机制,全球范围内建模或多式联运联合代表学习利用视觉特征。然而,注意机制缺乏模式而另两个提供固定视觉上下文,这是不适合用于产生翻译时建模所观察到的可变性之间的足够语义相互作用。为了解决上述问题,在本文中,我们提出了一个新的动态语境引导胶囊网络(DCCN)对MMT。具体地,在解码的每个时步,我们首先使用常规的源 - 目标注意,以产生特定的时间步长源侧上下文向量。接着,DCCN借此矢量作为输入,并使用它通过一个上下文引导动态路由机制,引导相关的视觉特征的迭代萃取。特别是,我们表示与全球及区域的视觉特征的输入图像,我们引入两个平行DCCNs到多模式情境矢量与不同粒度的视觉特征进行建模。最后,我们得到两个峰上下文载体,其被融合并引入到解码器中用于目标字的预测。对英语到德语和英语到法语翻译的Multi30K数据集实验结果表明DCCN的优越性。我们的代码可以在此HTTPS URL。
Huan Lin, Fandong Meng, Jinsong Su, Yongjing Yin, Zhengyuan Yang, Yubin Ge, Jie Zhou, Jiebo Luo
Abstract: Multimodal machine translation (MMT), which mainly focuses on enhancing text-only translation with visual features, has attracted considerable attention from both computer vision and natural language processing communities. Most current MMT models resort to attention mechanism, global context modeling or multimodal joint representation learning to utilize visual features. However, the attention mechanism lacks sufficient semantic interactions between modalities while the other two provide fixed visual context, which is unsuitable for modeling the observed variability when generating translation. To address the above issues, in this paper, we propose a novel Dynamic Context-guided Capsule Network (DCCN) for MMT. Specifically, at each timestep of decoding, we first employ the conventional source-target attention to produce a timestep-specific source-side context vector. Next, DCCN takes this vector as input and uses it to guide the iterative extraction of related visual features via a context-guided dynamic routing mechanism. Particularly, we represent the input image with global and regional visual features, we introduce two parallel DCCNs to model multimodal context vectors with visual features at different granularities. Finally, we obtain two multimodal context vectors, which are fused and incorporated into the decoder for the prediction of the target word. Experimental results on the Multi30K dataset of English-to-German and English-to-French translation demonstrate the superiority of DCCN. Our code is available on this https URL.
摘要:多式联运机器翻译(MMT),其中主要集中在提高纯文本翻译与视觉特征,吸引了相当多的关注来自计算机视觉和自然语言处理社区。目前大多数MMT模型诉诸注意机制,全球范围内建模或多式联运联合代表学习利用视觉特征。然而,注意机制缺乏模式而另两个提供固定视觉上下文,这是不适合用于产生翻译时建模所观察到的可变性之间的足够语义相互作用。为了解决上述问题,在本文中,我们提出了一个新的动态语境引导胶囊网络(DCCN)对MMT。具体地,在解码的每个时步,我们首先使用常规的源 - 目标注意,以产生特定的时间步长源侧上下文向量。接着,DCCN借此矢量作为输入,并使用它通过一个上下文引导动态路由机制,引导相关的视觉特征的迭代萃取。特别是,我们表示与全球及区域的视觉特征的输入图像,我们引入两个平行DCCNs到多模式情境矢量与不同粒度的视觉特征进行建模。最后,我们得到两个峰上下文载体,其被融合并引入到解码器中用于目标字的预测。对英语到德语和英语到法语翻译的Multi30K数据集实验结果表明DCCN的优越性。我们的代码可以在此HTTPS URL。
6. A Comprehensive Analysis of Information Leakage in Deep Transfer Learning [PDF] 返回目录
Cen Chen, Bingzhe Wu, Minghui Qiu, Li Wang, Jun Zhou
Abstract: Transfer learning is widely used for transferring knowledge from a source domain to the target domain where the labeled data is scarce. Recently, deep transfer learning has achieved remarkable progress in various applications. However, the source and target datasets usually belong to two different organizations in many real-world scenarios, potential privacy issues in deep transfer learning are posed. In this study, to thoroughly analyze the potential privacy leakage in deep transfer learning, we first divide previous methods into three categories. Based on that, we demonstrate specific threats that lead to unintentional privacy leakage in each category. Additionally, we also provide some solutions to prevent these threats. To the best of our knowledge, our study is the first to provide a thorough analysis of the information leakage issues in deep transfer learning methods and provide potential solutions to the issue. Extensive experiments on two public datasets and an industry dataset are conducted to show the privacy leakage under different deep transfer learning settings and defense solution effectiveness.
摘要:迁移学习被广泛用于从源域到标记的数据是稀缺的目标域传递知识。最近,深陷转会学习已在各种应用中取得了显着的进展。然而,源和目标数据集通常属于两个不同的组织在许多现实世界的场景,在深陷转会学习潜在的隐私问题被提出。在这项研究中,要彻底分析深陷转会学习的潜在的隐私泄露,我们首先将以前的方法分为三类。在此基础上,我们证明了导致在每个类别中无意泄露隐私的特定威胁。此外,我们还提供了一些解决方案,以阻止这些威胁。据我们所知,我们的研究首次提供在深陷转会学习方法的信息泄漏问题进行深入分析,并提供对这一问题的潜在解决方案。在两个公共数据集和数据集的进行,以显示在不同深陷转会学习环境和国防解决方案有效性的隐私泄漏的行业广泛的实验。
Cen Chen, Bingzhe Wu, Minghui Qiu, Li Wang, Jun Zhou
Abstract: Transfer learning is widely used for transferring knowledge from a source domain to the target domain where the labeled data is scarce. Recently, deep transfer learning has achieved remarkable progress in various applications. However, the source and target datasets usually belong to two different organizations in many real-world scenarios, potential privacy issues in deep transfer learning are posed. In this study, to thoroughly analyze the potential privacy leakage in deep transfer learning, we first divide previous methods into three categories. Based on that, we demonstrate specific threats that lead to unintentional privacy leakage in each category. Additionally, we also provide some solutions to prevent these threats. To the best of our knowledge, our study is the first to provide a thorough analysis of the information leakage issues in deep transfer learning methods and provide potential solutions to the issue. Extensive experiments on two public datasets and an industry dataset are conducted to show the privacy leakage under different deep transfer learning settings and defense solution effectiveness.
摘要:迁移学习被广泛用于从源域到标记的数据是稀缺的目标域传递知识。最近,深陷转会学习已在各种应用中取得了显着的进展。然而,源和目标数据集通常属于两个不同的组织在许多现实世界的场景,在深陷转会学习潜在的隐私问题被提出。在这项研究中,要彻底分析深陷转会学习的潜在的隐私泄露,我们首先将以前的方法分为三类。在此基础上,我们证明了导致在每个类别中无意泄露隐私的特定威胁。此外,我们还提供了一些解决方案,以阻止这些威胁。据我们所知,我们的研究首次提供在深陷转会学习方法的信息泄漏问题进行深入分析,并提供对这一问题的潜在解决方案。在两个公共数据集和数据集的进行,以显示在不同深陷转会学习环境和国防解决方案有效性的隐私泄漏的行业广泛的实验。
7. Data Readiness for Natural Language Processing [PDF] 返回目录
Fredrik Olsson, Magnus Sahlgren
Abstract: This document concerns data readiness in the context of machine learning and Natural Language Processing. It describes how an organization may proceed to identify, make available, validate, and prepare data to facilitate automated analysis methods. The contents of the document is based on the practical challenges and frequently asked questions we have encountered in our work as an applied research institute with helping organizations and companies, both in the public and private sectors, to use data in their business processes.
摘要:在机器学习和自然语言处理的背景下,这个文件的关注数据的准备。它描述了组织可能如何进行识别,普及,验证和准备数据,以方便自动分析方法。该文件的内容是基于我们在我们的工作中遇到的作为应用研究所与帮助组织和公司,无论是在公共部门和私营部门,其业务流程中使用的数据的实际挑战和常见问题。
Fredrik Olsson, Magnus Sahlgren
Abstract: This document concerns data readiness in the context of machine learning and Natural Language Processing. It describes how an organization may proceed to identify, make available, validate, and prepare data to facilitate automated analysis methods. The contents of the document is based on the practical challenges and frequently asked questions we have encountered in our work as an applied research institute with helping organizations and companies, both in the public and private sectors, to use data in their business processes.
摘要:在机器学习和自然语言处理的背景下,这个文件的关注数据的准备。它描述了组织可能如何进行识别,普及,验证和准备数据,以方便自动分析方法。该文件的内容是基于我们在我们的工作中遇到的作为应用研究所与帮助组织和公司,无论是在公共部门和私营部门,其业务流程中使用的数据的实际挑战和常见问题。
8. A Hybrid Deep Learning Model for Arabic Text Recognition [PDF] 返回目录
Mohammad Fasha, Bassam Hammo, Nadim Obeid, Jabir Widian
Abstract: Arabic text recognition is a challenging task because of the cursive nature of Arabic writing system, its joint writing scheme, the large number of ligatures and many other challenges. Deep Learning DL models achieved significant progress in numerous domains including computer vision and sequence modelling. This paper presents a model that can recognize Arabic text that was printed using multiple font types including fonts that mimic Arabic handwritten scripts. The proposed model employs a hybrid DL network that can recognize Arabic printed text without the need for character segmentation. The model was tested on a custom dataset comprised of over two million word samples that were generated using 18 different Arabic font types. The objective of the testing process was to assess the model capability in recognizing a diverse set of Arabic fonts representing a varied cursive styles. The model achieved good results in recognizing characters and words and it also achieved promising results in recognizing characters when it was tested on unseen data. The prepared model, the custom datasets and the toolkit for generating similar datasets are made publicly available, these tools can be used to prepare models for recognizing other font types as well as to further extend and enhance the performance of the proposed model.
摘要:阿拉伯语文字识别是一项具有挑战性的任务,因为阿拉伯语书写系统,其合资写作计划,大量的绷带和许多其他挑战的草书性质。深度学习DL车型在众多领域,包括计算机视觉和序列建模取得显著的进展。本文提出了能够识别出使用多种字体类型,包括字体,模仿阿拉伯语手写脚本印制阿拉伯语文本模型。所提出的模型采用能识别阿拉伯打印文本而无需字符分割混合DL网络。该模型是由上使用18种不同的阿拉伯字体类型生成的超过两百万字的样本的数据集定制测试。测试过程的目的是评估在识别一组不同的代表变化的草书风格的阿拉伯字体的型号能力。该模型在识别字符和单词取得了良好效果,同时也实现了,当它是在看不见的测试数据中识别字符可喜的成果。准备好的模型,自定义数据集,并产生类似的数据集的工具都公之于众,这些工具可用于识别其他的字体类型,以及进一步扩展和增强了模型的性能做准备车型。
Mohammad Fasha, Bassam Hammo, Nadim Obeid, Jabir Widian
Abstract: Arabic text recognition is a challenging task because of the cursive nature of Arabic writing system, its joint writing scheme, the large number of ligatures and many other challenges. Deep Learning DL models achieved significant progress in numerous domains including computer vision and sequence modelling. This paper presents a model that can recognize Arabic text that was printed using multiple font types including fonts that mimic Arabic handwritten scripts. The proposed model employs a hybrid DL network that can recognize Arabic printed text without the need for character segmentation. The model was tested on a custom dataset comprised of over two million word samples that were generated using 18 different Arabic font types. The objective of the testing process was to assess the model capability in recognizing a diverse set of Arabic fonts representing a varied cursive styles. The model achieved good results in recognizing characters and words and it also achieved promising results in recognizing characters when it was tested on unseen data. The prepared model, the custom datasets and the toolkit for generating similar datasets are made publicly available, these tools can be used to prepare models for recognizing other font types as well as to further extend and enhance the performance of the proposed model.
摘要:阿拉伯语文字识别是一项具有挑战性的任务,因为阿拉伯语书写系统,其合资写作计划,大量的绷带和许多其他挑战的草书性质。深度学习DL车型在众多领域,包括计算机视觉和序列建模取得显著的进展。本文提出了能够识别出使用多种字体类型,包括字体,模仿阿拉伯语手写脚本印制阿拉伯语文本模型。所提出的模型采用能识别阿拉伯打印文本而无需字符分割混合DL网络。该模型是由上使用18种不同的阿拉伯字体类型生成的超过两百万字的样本的数据集定制测试。测试过程的目的是评估在识别一组不同的代表变化的草书风格的阿拉伯字体的型号能力。该模型在识别字符和单词取得了良好效果,同时也实现了,当它是在看不见的测试数据中识别字符可喜的成果。准备好的模型,自定义数据集,并产生类似的数据集的工具都公之于众,这些工具可用于识别其他的字体类型,以及进一步扩展和增强了模型的性能做准备车型。
9. CoNCRA: A Convolutional Neural Network Code Retrieval Approach [PDF] 返回目录
Marcelo de Rezende Martins, Marco A. Gerosa
Abstract: Software developers routinely search for code using general-purpose search engines. However, these search engines cannot find code semantically unless it has an accompanying description. We propose a technique for semantic code search: A Convolutional Neural Network approach to code retrieval (CoNCRA). Our technique aims to find the code snippet that most closely matches the developer's intent, expressed in natural language. We evaluated our approach's efficacy on a dataset composed of questions and code snippets collected from Stack Overflow. Our preliminary results showed that our technique, which prioritizes local interactions (words nearby), improved the state-of-the-art (SOTA) by 5% on average, retrieving the most relevant code snippets in the top 3 (three) positions by almost 80% of the time. Therefore, our technique is promising and can improve the efficacy of semantic code retrieval.
摘要:软件开发人员经常搜索使用通用搜索引擎的代码。然而,这些搜索引擎无法找到语义代码,除非它有一个附带的说明。我们提出了语义代码搜索技术:卷积神经网络的方法来检索代码(CoNCRA)。我们的技术旨在找出最符合开发者的意图的代码片段,在自然语言表达。我们评价一个从堆栈溢出收集问题和代码片断组成的数据集我们的方法的有效性。我们的初步结果表明,我们的技术,其中优先局部相互作用(字附近),平均提高5%提高国家的最先进的(SOTA),3(三)位置通过检索最相关的代码片段在顶部几乎80%的时间。因此,我们的技术是有希望的,并能提高语义检索代码的疗效。
Marcelo de Rezende Martins, Marco A. Gerosa
Abstract: Software developers routinely search for code using general-purpose search engines. However, these search engines cannot find code semantically unless it has an accompanying description. We propose a technique for semantic code search: A Convolutional Neural Network approach to code retrieval (CoNCRA). Our technique aims to find the code snippet that most closely matches the developer's intent, expressed in natural language. We evaluated our approach's efficacy on a dataset composed of questions and code snippets collected from Stack Overflow. Our preliminary results showed that our technique, which prioritizes local interactions (words nearby), improved the state-of-the-art (SOTA) by 5% on average, retrieving the most relevant code snippets in the top 3 (three) positions by almost 80% of the time. Therefore, our technique is promising and can improve the efficacy of semantic code retrieval.
摘要:软件开发人员经常搜索使用通用搜索引擎的代码。然而,这些搜索引擎无法找到语义代码,除非它有一个附带的说明。我们提出了语义代码搜索技术:卷积神经网络的方法来检索代码(CoNCRA)。我们的技术旨在找出最符合开发者的意图的代码片段,在自然语言表达。我们评价一个从堆栈溢出收集问题和代码片断组成的数据集我们的方法的有效性。我们的初步结果表明,我们的技术,其中优先局部相互作用(字附近),平均提高5%提高国家的最先进的(SOTA),3(三)位置通过检索最相关的代码片段在顶部几乎80%的时间。因此,我们的技术是有希望的,并能提高语义检索代码的疗效。
10. Multi-Perspective Semantic Information Retrieval [PDF] 返回目录
Samarth Rawal, Chitta Baral
Abstract: Information Retrieval (IR) is the task of obtaining pieces of data (such as documents or snippets of text) that are relevant to a particular query or need from a large repository of information. While a combination of traditional keyword- and modern BERT-based approaches have been shown to be effective in recent work, there are often nuances in identifying what information is "relevant" to a particular query, which can be difficult to properly capture using these systems. This work introduces the concept of a Multi-Perspective IR system, a novel methodology that combines multiple deep learning and traditional IR models to better predict the relevance of a query-sentence pair, along with a standardized framework for tuning this system. This work is evaluated on the BioASQ Biomedical IR + QA challenges.
摘要:信息检索(IR)是获得数据块(如文件或文本片段)是相关的,从一个大的库信息的特定查询或需要的任务。虽然传统的关键字和基于现代BERT-方法的组合已经被证明是有效的在最近的工作中,经常是在确定哪些信息是“相关”,以一个特定的查询,这可能很难使用这些系统正常捕捉细微差别。这项工作引入了多角度的IR系统的概念,一种新的方法,它结合了多种深度学习和传统IR模型来更好地预测查询句对的相关性,与调整,该系统标准化框架一起。这项工作是在BioASQ生物医学IR + QA的挑战进行评估。
Samarth Rawal, Chitta Baral
Abstract: Information Retrieval (IR) is the task of obtaining pieces of data (such as documents or snippets of text) that are relevant to a particular query or need from a large repository of information. While a combination of traditional keyword- and modern BERT-based approaches have been shown to be effective in recent work, there are often nuances in identifying what information is "relevant" to a particular query, which can be difficult to properly capture using these systems. This work introduces the concept of a Multi-Perspective IR system, a novel methodology that combines multiple deep learning and traditional IR models to better predict the relevance of a query-sentence pair, along with a standardized framework for tuning this system. This work is evaluated on the BioASQ Biomedical IR + QA challenges.
摘要:信息检索(IR)是获得数据块(如文件或文本片段)是相关的,从一个大的库信息的特定查询或需要的任务。虽然传统的关键字和基于现代BERT-方法的组合已经被证明是有效的在最近的工作中,经常是在确定哪些信息是“相关”,以一个特定的查询,这可能很难使用这些系统正常捕捉细微差别。这项工作引入了多角度的IR系统的概念,一种新的方法,它结合了多种深度学习和传统IR模型来更好地预测查询句对的相关性,与调整,该系统标准化框架一起。这项工作是在BioASQ生物医学IR + QA的挑战进行评估。
注:中文为机器翻译结果!封面为论文标题词云图!