摘要

1. Contextualized Spoken Word Representations from Convolutional Autoencoders [PDF] 返回目录
Prakamya Mishra, Pranav Mathur
Abstract: A lot of work has been done recently to build sound language models for the textual data, but not much such has been done in the case of speech/audio type data. In the case of text, words can be represented by a unique fixed-length vector. Such models for audio type data can not only lead to great advances in the speech-related natural language processing tasks but can also reduce the need for converting speech to text for performing the same. This paper proposes a novel model architecture that produces syntactically, contextualized, and semantically adequate representation of varying length spoken words. The performance of the spoken word embeddings generated by the proposed model was validated by (1) inspecting the vector space generated, and (2) evaluating its performance on the downstream task of next spoken word prediction in a speech.
摘要：很多工作已经最近做构建完善的语言模型的文本数据，但没有太多这样已经语音/音频类型数据的情况下已经完成。在文本的情况下，字可以由一个唯一的固定长度的向量来表示。音频数据类型这类模型不仅可以带来巨大进步的语音相关的自然语言处理任务，而且还可以减少语音转化为文本执行相同的需求。本文提出了一种新颖的模型结构产生句法，语境和不同长度口语单词的语义足够代表性。通过所提出的模型生成的口语单词的嵌入的性能通过（1）验证检查（2）评价其对下一个口语单词的预测在语音下游任务性能产生的向量空间，并。

2. DART: Open-Domain Structured Data Record to Text Generation [PDF] 返回目录
Dragomir Radev, Rui Zhang, Amrit Rau, Abhinand Sivaprasad, Chiachun Hsieh, Nazneen Fatema Rajani, Xiangru Tang, Aadit Vyas, Neha Verma, Pranav Krishna, Yangxiaokang Liu, Nadia Irwanto, Jessica Pan, Faiaz Rahman, Ahmad Zaidi, Murori Mutuma, Yasin Tarabar, Ankit Gupta, Tao Yu, Yi Chern Tan, Xi Victoria Lin, Caiming Xiong, Richard Socher
Abstract: We introduce DART, a large dataset for open-domain structured data record to text generation. We consider the structured data record input as a set of RDF entity-relation triples, a format widely used for knowledge representation and semantics description. DART consists of 82,191 examples across different domains with each input being a semantic RDF triple set derived from data records in tables and the tree ontology of the schema, annotated with sentence descriptions that cover all facts in the triple set. This hierarchical, structured format with its open-domain nature differentiates DART from other existing table-to-text corpora. We conduct an analysis of DART on several state-of-the-art text generation models, showing that it introduces new and interesting challenges compared to existing datasets. Furthermore, we demonstrate that finetuning pretrained language models on DART facilitates out-of-domain generalization on the WebNLG 2017 dataset. DART is available at this https URL.
摘要：介绍DART，大型数据集的结构化数据记录到文本代开域。我们认为，结构化数据记录输入为一组RDF实体 - 关系三元组，广泛用于知识表示和语义描述的格式。 DART包含跨不同的域82191个实例与每个输入从数据记录导出表语义RDF三元组和所述模式的树本体，与覆盖在该三重集合中的所有事实句子描述注释。这个层次，结构与它的开放域的自然格式与其他现有的表到文本的语料库区分DART。我们在国家的最先进的几种文本生成模型进行DART的分析，表明它引入了比现有数据集的新的和有趣的挑战。此外，我们证明了在DART微调预训练的语言模型有利于在2017年WebNLG集域外的泛化。 DART可在此HTTPS URL。

3. Sentiment Polarity Detection on Bengali Book Reviews Using Multinomial Naive Bayes [PDF] 返回目录
Eftekhar Hossain, Omar Sharif, Mohammed Moshiul Hoque
Abstract: Recently, sentiment polarity detection has increased attention to NLP researchers due to the massive availability of customer's opinions or reviews in the online platform. Due to the continued expansion of e-commerce sites, the rate of purchase of various products, including books, are growing enormously among the people. Reader's opinions/reviews affect the buying decision of a customer in most cases. This work introduces a machine learning-based technique to determine sentiment polarities (either positive or negative category) from Bengali book reviews. To assess the effectiveness of the proposed technique, a corpus with 2000 reviews on Bengali books is developed. A comparative analysis with various approaches (such as logistic regression, naive Bayes, SVM, and SGD) also performed by taking into consideration of the unigram, bigram, and trigram features, respectively. Experimental result reveals that the multinomial Naive Bayes with unigram feature outperforms the other techniques with 84% accuracy on the test set.
摘要：近日，情感极性检测已经越来越关注NLP研究者由于客户的意见或评论在网上平台的大规模可用性。由于电子商务网站的不断扩展，购买各种产品，包括书籍，的速度被人们越来越巨大。读者的意见/评论影响客户在大多数情况下，购买决策。这项工作从引入孟加拉书评机器基于学习的技术来确定情绪极性（正或负类）。为了评估所提出的技术的有效性，对孟加拉书2000条的评论文集开发。比较分析用各种方法（如逻辑回归，朴素贝叶斯，SVM，和SGD）也通过考虑的单字组，二元，和三字母组的特性，分别进行的。实验结果显示，与单字组特征多项式朴素贝叶斯优于其它技术带有关于在测试组84％的精度。

4. Bilingual Dictionary Based Neural Machine Translation without Using Parallel Sentences [PDF] 返回目录
Xiangyu Duan, Baijun Ji, Hao Jia, Min Tan, Min Zhang, Boxing Chen, Weihua Luo, Yue Zhang
Abstract: In this paper, we propose a new task of machine translation (MT), which is based on no parallel sentences but can refer to a ground-truth bilingual dictionary. Motivated by the ability of a monolingual speaker learning to translate via looking up the bilingual dictionary, we propose the task to see how much potential an MT system can attain using the bilingual dictionary and large scale monolingual corpora, while is independent on parallel sentences. We propose anchored training (AT) to tackle the task. AT uses the bilingual dictionary to establish anchoring points for closing the gap between source language and target language. Experiments on various language pairs show that our approaches are significantly better than various baselines, including dictionary-based word-by-word translation, dictionary-supervised cross-lingual word embedding transformation, and unsupervised MT. On distant language pairs that are hard for unsupervised MT to perform well, AT performs remarkably better, achieving performances comparable to supervised SMT trained on more than 4M parallel sentences.
摘要：在本文中，我们提出了机器翻译（MT），这是基于不平行的句子，但可以指地面实况双语词典的新任务。由单语扬声器的学习通过仰视双语词典翻译能力的启发，我们建议看到一个MT系统能有多大的潜力得到使用双语词典以及大规模的单语语料库，而独立于并行语句的任务。我们建议锚定训练（AT）来解决的任务。 AT使用双语词典建立锚固点闭源语言和目标语言之间的差距。各种语言对实验结果表明，我们的方法是显著优于各种基线，包括基于字典的字由字的翻译，字典监督跨语种字嵌入改造，且无人监管的MT。以硬式无监督MT表现良好，AT执行显着好遥远的语言对，实现演出媲美的培训了超过4M平行句监督SMT。

5. A Broad-Coverage Deep Semantic Lexicon for Verbs [PDF] 返回目录
James Allen, Hannah An, Ritwik Bose, Will de Beaumont, Choh Man Teng
Abstract: Progress on deep language understanding is inhibited by the lack of a broad coverage lexicon that connects linguistic behavior to ontological concepts and axioms. We have developed COLLIE-V, a deep lexical resource for verbs, with the coverage of WordNet and syntactic and semantic details that meet or exceed existing resources. Bootstrapping from a hand-built lexicon and ontology, new ontological concepts and lexical entries, together with semantic role preferences and entailment axioms, are automatically derived by combining multiple constraints from parsing dictionary definitions and examples. We evaluated the accuracy of the technique along a number of different dimensions and were able to obtain high accuracy in deriving new concepts and lexical entries. COLLIE-V is publicly available.
摘要：深语言理解的进展因缺乏连接语言行为到本体论的概念和公理覆盖面广泛的词汇的抑制。我们已经开发COLLIE-V，对于动词深词汇资源，与WordNet的覆盖面，句法和语义信息，达到或超过现有资源。从手工打造的词汇和本体论，本体论的新概念和词汇条目自举，与语义角色的喜好和蕴涵公理一起，会自动多重限制从解析字典定义和实例相结合的。我们评估沿着多个不同尺寸的技术的精确度，而且能够获得在获得新的概念和词汇的条目，精度高。 COLLIE-V是公开的。

6. Learning Spoken Language Representations with Neural Lattice Language Modeling [PDF] 返回目录
Chao-Wei Huang, Yun-Nung Chen
Abstract: Pre-trained language models have achieved huge improvement on many NLP tasks. However, these methods are usually designed for written text, so they do not consider the properties of spoken language. Therefore, this paper aims at generalizing the idea of language model pre-training to lattices generated by recognition systems. We propose a framework that trains neural lattice language models to provide contextualized representations for spoken language understanding tasks. The proposed two-stage pre-training approach reduces the demands of speech data and has better efficiency. Experiments on intent detection and dialogue act recognition datasets demonstrate that our proposed method consistently outperforms strong baselines when evaluated on spoken inputs. The code is available at this https URL.
摘要：预先训练语言模型已经在许多NLP任务取得巨大的进步。然而，这些方法通常被设计为书面文字，所以他们不认为语言的特性。因此，本文旨在通过推广语言模型的想法前培训，通过识别系统生成的格。我们提出了一个框架，列车神经晶格语言模型，为口语理解任务情境表示。所提出的两阶段前的训练方法降低语音数据的需求，并具有更好的效率。对意图探测和对话行为识别的数据集实验证明在口语输入进行评估时，我们提出的方法的性能一直优于强基线。该代码可在此HTTPS URL。

7. Relevance Transformer: Generating Concise Code Snippets with Relevance Feedback [PDF] 返回目录
Carlos Gemmell, Federico Rossetto, Jeffrey Dalton
Abstract: Tools capable of automatic code generation have the potential to augment programmer's capabilities. While straightforward code retrieval is incorporated into many IDEs, an emerging area is explicit code generation. Code generation is currently approached as a Machine Translation task, with Recurrent Neural Network (RNN) based encoder-decoder architectures trained on code-description pairs. In this work we introduce and study modern Transformer architectures for this task. We further propose a new model called the Relevance Transformer that incorporates external knowledge using pseudo-relevance feedback. The Relevance Transformer biases the decoding process to be similar to existing retrieved code while enforcing diversity. We perform experiments on multiple standard benchmark datasets for code generation including Django, Hearthstone, and CoNaLa. The results show improvements over state-of-the-art methods based on BLEU evaluation. The Relevance Transformer model shows the potential of Transformer-based architectures for code generation and introduces a method of incorporating pseudo-relevance feedback during inference.
摘要：能够自动代码生成工具必须扩充程序员的能力的潜力。虽然简单的代码检索纳入许多IDE，一个新兴的领域是明确的代码生成。基于对训练的代号描述对编码解码器架构代码生成目前正在接洽的机器翻译任务，用递归神经网络（RNN）。在这项工作中，我们引进和研究现代变压器的架构完成这个任务。我们进一步提出了一个叫做关联变压器新模式，使用伪相关反馈结合外部知识。从关联变压器偏压解码处理为类似于现有检索代码，而执行分集。我们执行多个标准的基准数据集实验代码生成包括Django中，炉石，和CoNaLa。结果表明在基于BLEU评估状态的最先进的方法的改进。从关联变压器模型示出了基于变压器的体系结构用于代码生成，并引入推理过程中掺入伪相关反馈的方法的潜力。

8. Reflection-based Word Attribute Transfer [PDF] 返回目录
Yoichi Ishibashi, Katsuhito Sudoh, Koichiro Yoshino, Satoshi Nakamura
Abstract: Word embeddings, which often represent such analogic relations as king - man + woman = queen, can be used to change a word's attribute, including its gender. For transferring king into queen in this analogy-based manner, we subtract a difference vector man - woman based on the knowledge that king is male. However, developing such knowledge is very costly for words and attributes. In this work, we propose a novel method for word attribute transfer based on reflection mappings without such an analogy operation. Experimental results show that our proposed method can transfer the word attributes of the given words without changing the words that do not have the target attributes.
摘要：Word中的嵌入，这往往表示这样的类比关系为王 - 男人女人+ =女王，可以用来改变一个单词的属性，包括其性别。对于转移到国王王后在此基础类比的方式，我们减去一个差向量男人 - 基于这样的认识：王男的女人。然而，开发这种知识对于单词和属性非常昂贵的。在这项工作中，我们提出了基于反射映象字属性传递，而不这样的比喻操作的新方法。实验结果表明，该方法可以转让给定词的词属性，而不改变那些没有目标属性的话。

9. LMVE at SemEval-2020 Task 4: Commonsense Validation and Explanation using Pretraining Language Model [PDF] 返回目录
Shilei Liu, Yu Guo, Bochao Li, Feiliang Ren
Abstract: This paper describes our submission to subtask a and b of SemEval-2020 Task 4. For subtask a, we use a ALBERT based model with improved input form to pick out the common sense statement from two statement candidates. For subtask b, we use a multiple choice model enhanced by hint sentence mechanism to select the reason from given options about why a statement is against common sense. Besides, we propose a novel transfer learning strategy between subtasks which help improve the performance. The accuracy scores of our system are 95.6 / 94.9 on official test set and rank 7$^{th}$ / 2$^{nd}$ on Post-Evaluation leaderboard.
摘要：本文介绍了我们提交给子任务a和b SemEval-2020任务4.为子任务，我们使用改进的输入形式的伟业基于模型从两个声明候选人挑出来的常识声明。对于子任务B，我们使用了多选模式通过提示一句机制增强，选择为什么一个说法是对常识给定的选项的原因。此外，我们提出的子任务，帮助提高性能之间的一种新的传输的学习策略。我们的系统的精确度得分官方测试集95.6 / 94.9和7级$ ^ {}第$ / 2 $ ^ {}次在$后评价排行榜。

10. CORD19STS: COVID-19 Semantic Textual Similarity Dataset [PDF] 返回目录
Xiao Guo, Hengameh Mirzaalian, Ekraam Sabir, Aysush Jaiswal, Wael Abd-Almageed
Abstract: In order to combat the COVID-19 pandemic, society can benefit from various natural language processing applications, such as dialog medical diagnosis systems and information retrieval engines calibrated specifically for COVID-19. These applications rely on the ability to measure semantic textual similarity (STS), making STS a fundamental task that can benefit several downstream applications. However, existing STS datasets and models fail to translate their performance to a domain-specific environment such as COVID-19. To overcome this gap, we introduce CORD19STS dataset which includes 13,710 annotated sentence pairs collected from COVID-19 open research dataset (CORD-19) challenge. To be specific, we generated one million sentence pairs using different sampling strategies. We then used a finetuned BERT-like language model, which we call Sen-SCI-CORD19-BERT, to calculate the similarity scores between sentence pairs to provide a balanced dataset with respect to the different semantic similarity levels, which gives us a total of 32K sentence pairs. Each sentence pair was annotated by five Amazon Mechanical Turk (AMT) crowd workers, where the labels represent different semantic similarity levels between the sentence pairs (i.e. related, somewhat-related, and not-related). After employing a rigorous qualification tasks to verify collected annotations, our final CORD19STS dataset includes 13,710 sentence pairs.
摘要：为了打击COVID-19大流行，社会可以从各种自然语言处理的应用，如专门为COVID-19校准对话框医疗诊断系统和信息检索引擎中获益。这些应用程序依赖于测量语义文本相似性（STS）的能力，使得STS，可以造福一些下游应用的根本任务。但是，现有的STS数据集和模型不能转化其性能提升到一个域的特定环境，如COVID-19。为了克服这个差距，我们介绍CORD19STS数据集，其中包括从COVID-19开放的研究数据集（CORD-19）挑战收集13710注释句对。具体而言，我们产生使用不同的抽样策略百万句对。然后，我们使用一个微调，BERT般的语言模型，我们称之为森SCI-CORD19-BERT，计算句子对之间的相似性得分提供均衡的数据集相对于不同的语义相似的水平，这给了我们一个总的32K句对。每个句子对由五位亚马逊机械特克（AMT）人群的工人，其中所述标签表示的句子对之间的不同语义相似度水平（即相关，有些相关的，而不是相关的）注释。采用严格的资格任务后核实集解，我们的最终CORD19STS数据集包括13710句对。

11. Exploratory Analysis of COVID-19 Related Tweets in North America to Inform Public Health Institutes [PDF] 返回目录
Hyeju Jang, Emily Rempel, Giuseppe Carenini, Naveed Janjua
Abstract: Social media is a rich source where we can learn about people's reactions to social issues. As COVID-19 has significantly impacted on people's lives, it is essential to capture how people react to public health interventions and understand their concerns. In this paper, we aim to investigate people's reactions and concerns about COVID-19 in North America, especially focusing on Canada. We analyze COVID-19 related tweets using topic modeling and aspect-based sentiment analysis, and interpret the results with public health experts. We compare timeline of topics discussed with timing of implementation of public health interventions for COVID-19. We also examine people's sentiment about COVID-19 related issues. We discuss how the results can be helpful for public health agencies when designing a policy for new interventions. Our work shows how Natural Language Processing (NLP) techniques could be applied to public health questions with domain expert involvement.
摘要：社会化媒体是一个丰富的资源，我们可以了解人们的反应社会问题。作为COVID-19已显著对人们的生活的影响，有必要捕捉到人们如何应对公共卫生干预措施，并了解他们的关注。在本文中，我们的目标是调查人们的反应和担忧COVID-19在北美，尤其是专注于加拿大。我们分析使用主题建模和基于方面，情感分析COVID-19相关的微博，并解释与公共卫生专家的结果。我们比较实施的公共卫生干预措施的时机COVID-19讨论的话题的时间表。我们还检查了人们对COVID-19相关问题情绪。我们讨论设计新的干预策略时的结果如何对公共卫生机构的帮助。我们的工作表明如何自然语言处理（NLP）技术可应用于与领域专家参与的公共健康问题。

12. Improving Chinese Segmentation-free Word Embedding With Unsupervised Association Measure [PDF] 返回目录
Yifan Zhang, Maohua Wang, Yongjian Huang, Qianrong Gu
Abstract: Recent work on segmentation-free word embedding(sembei) developed a new pipeline of word embedding for unsegmentated language while avoiding segmentation as a preprocessing step. However, too many noisy n-grams existing in the embedding vocabulary that do not have strong association strength between characters would limit the quality of learned word embedding. To deal with this problem, a new version of segmentation-free word embedding model is proposed by collecting n-grams vocabulary via a novel unsupervised association measure called pointwise association with times information(PATI). Comparing with the commonly used n-gram filtering method like frequency used in sembei and pointwise mutual information(PMI), the proposed method leverages more latent information from the corpus and thus is able to collect more valid n-grams that have stronger cohesion as embedding targets in unsegmented language data, such as Chinese texts. Further experiments on Chinese SNS data show that the proposed model improves performance of word embedding in downstream tasks.
摘要：自由分割字嵌入（sembei）最近的研究开发字嵌入了unsegmentated语言，同时避免分割作为预处理工序的新管道。然而，太多嘈杂的n-gram存在没有字符之间很强的关联强度将限制了解到字嵌入的质量嵌入的词汇。为了解决这个问题，分割的自由字嵌入模型的新版本通过收集正克通过调用时间信息（PATI）逐点相关联的新型无监督协会的措施提出的词汇。与常用的n-gram中过滤像sembei使用的频率的方法相比较，并逐点互信息（PMI），所提出的方法杠杆从所述语料库更多潜在信息，并且因此能够收集更多的有效的n-gram具有更强的凝聚力嵌入目标在不分段的语言数据，比如中国文字。对中国SNS的数据进一步的实验表明，该模型提高了字，下游的任务嵌入的性能。

13. EmotionGIF-Yankee: A Sentiment Classifier with Robust Model Based Ensemble Methods [PDF] 返回目录
Wei-Yao Wang, Kai-Shiang Chang, Yu-Chien Tang
Abstract: This paper provides a method to classify sentiment with robust model based ensemble methods. We preprocess tweet data to enhance coverage of tokenizer. To reduce domain bias, we first train tweet dataset for pre-trained language model. Besides, each classifier has its strengths and weakness, we leverage different types of models with ensemble methods: average and power weighted sum. From the experiments, we show that our approach has achieved positive effect for sentiment classification. Our system reached third place among 26 teams from the evaluation in SocialNLP 2020 EmotionGIF competition.
摘要：本文提供分类情绪的方法与基于稳健的模型集成方法。我们鸣叫进行预处理数据以增强分词的覆盖面。为了减少域的偏见，我们第一次训练鸣叫数据集预训练语言模型。此外，每个分类都有其长处和弱点，我们利用不同类型的模型与集成方法：平均和功率加权和。通过实验，我们表明，我们的方法已经取得了情感分类的积极作用。我们的系统达到了从SocialNLP 2020 EmotionGIF竞争评估26队之中第三位。

14. Unsupervised Paraphrasing via Deep Reinforcement Learning [PDF] 返回目录
A. B. Siddique, Samet Oymak, Vagelis Hristidis
Abstract: Paraphrasing is expressing the meaning of an input sentence in different wording while maintaining fluency (i.e., grammatical and syntactical correctness). Most existing work on paraphrasing use supervised models that are limited to specific domains (e.g., image captions). Such models can neither be straightforwardly transferred to other domains nor generalize well, and creating labeled training data for new domains is expensive and laborious. The need for paraphrasing across different domains and the scarcity of labeled training data in many such domains call for exploring unsupervised paraphrase generation methods. We propose Progressive Unsupervised Paraphrasing (PUP): a novel unsupervised paraphrase generation method based on deep reinforcement learning (DRL). PUP uses a variational autoencoder (trained using a non-parallel corpus) to generate a seed paraphrase that warm-starts the DRL model. Then, PUP progressively tunes the seed paraphrase guided by our novel reward function which combines semantic adequacy, language fluency, and expression diversity measures to quantify the quality of the generated paraphrases in each iteration without needing parallel sentences. Our extensive experimental evaluation shows that PUP outperforms unsupervised state-of-the-art paraphrasing techniques in terms of both automatic metrics and user studies on four real datasets. We also show that PUP outperforms domain-adapted supervised algorithms on several datasets. Our evaluation also shows that PUP achieves a great trade-off between semantic similarity and diversity of expression.
摘要：复述被表达的输入句子的含义，不同的措词，同时保持流畅（即，语法和句法的正确性）。复述利用大多数现有的工作是监督仅限于特定领域（例如，图像标题）模型。这样的模型既不能直接地转移到其他域也不概括良好，并为新的域创建标记的训练数据是昂贵和费力的。对于需要跨不同的域意译和标记的训练数据，在许多此类领域的稀缺性要求探索监督的意译生成方法。我们建议逐步无监督复述（PUP）：根据深强化学习（DRL）一种新型的无监督意译生成方法。 PUP使用变分自动编码，以生成种子复述该预热启动DRL模型（使用非平行语料库训练的）。然后，逐步PUP调谐由我们结合了语义充足，语言流畅，和表达的多样性的措施来量化在每次迭代中生成的释义的质量，而不需要平行句子新颖回报函数引导的种子复述。我们广泛的试验评估表明，PUP优于国家的最先进的无监督意译技术在四个真实数据集都自动度量和用户研究方面。我们还表明在几个数据集是PUP性能优于域适应监督的算法。我们的评估也表明，PUP实现语义相似性和表达的多样性之间有很大的权衡。

15. News Sentiment Analysis [PDF] 返回目录
Antony Samuels, John Mcgonical
Abstract: Modern technological era has reshaped traditional lifestyle in several domains. The medium of publishing news and events has become faster with the advancement of Information Technology. IT has also been flooded with immense amounts of data, which is being published every minute of every day, by millions of users, in the shape of comments, blogs, news sharing through blogs, social media micro-blogging websites and many more. Manual traversal of such huge data is a challenging job, thus, sophisticated methods are acquired to perform this task automatically and efficiently. News reports events that comprise of emotions - good, bad, neutral. Sentiment analysis is utilized to investigate human emotions present in textual information. This paper presents a lexicon-based approach for sentiment analysis of news articles. The experiments have been performed on BBC news data set, which expresses the applicability and validation of the adopted approach.
摘要：现代科技时代在若干领域已经重塑了传统的生活方式。出版新闻和事件媒体已成为信息技术的进步更快。它也一直充斥着巨大的数据量，这是正在公布每一天的每一分钟，数以百万计的用户，在评论，博客，通过博客，社交媒体微博客网站和更多的新闻共享的形状。这样庞大的数据手册遍历是一个具有挑战性的工作，因此，复杂的方法获取自动，高效地执行此任务。新闻报道说，包括情感的事件 - 好的，坏的，中立的。情感分析被用来调查存在于文本信息的人类情感。本文提出了新闻报道的情感分析基于词典的方法。该实验已经在英国广播公司的新闻数据集，它体现了采取方法的适用性和有效性进行。

16. Sentiment Analysis on Customer Responses [PDF] 返回目录
Antony Samuels, John Mcgonical
Abstract: Sentiment analysis is one of the fastest spreading research areas in computer science, making it challenging to keep track of all the activities in the area. We present a customer feedback reviews on product, where we utilize opinion mining, text mining and sentiments, which has affected the surrounded world by changing their opinion on a specific product. Data used in this study are online product reviews collected from this http URL. We performed a comparative sentiment analysis of retrieved reviews. This research paper provides you with sentimental analysis of various smart phone opinions on smart phones dividing them Positive, Negative and Neutral Behaviour.
摘要：情感分析是计算机科学传播速度最快的研究领域之一，使得它具有挑战性，以保持该地区的所有活动的轨迹。我们目前的产品客户反馈的评论，在这里我们利用意见挖掘，文本挖掘和情绪，这已经改变特定产品自己的意见影响了世界包围。在本研究中使用的数据是从这个HTTP URL收集在线产品评论。我们进行的检索审查比较情绪分析。本研究报告提供了对智能手机将它们划分阳性，阴性和中性行为的各种智能手机的意见感伤分析。

17. Birds of a Feather Flock Together: Satirical News Detection via Language Model Differentiation [PDF] 返回目录
Yigeng Zhang, Fan Yang, Yifan Zhang, Eduard Dragut, Arjun Mukherjee
Abstract: Satirical news is regularly shared in modern social media because it is entertaining with smartly embedded humor. However, it can be harmful to society because it can sometimes be mistaken as factual news, due to its deceptive character. We found that in satirical news, the lexical and pragmatical attributes of the context are the key factors in amusing the readers. In this work, we propose a method that differentiates the satirical news and true news. It takes advantage of satirical writing evidence by leveraging the difference between the prediction loss of two language models, one trained on true news and the other on satirical news, when given a new news article. We compute several statistical metrics of language model prediction loss as features, which are then used to conduct downstream classification. The proposed method is computationally effective because the language models capture the language usage differences between satirical news documents and traditional news documents, and are sensitive when applied to documents outside their domains.
摘要：讽刺的消息是在现代社会媒体定期共享，因为它与巧妙嵌入幽默的娱乐。但是，它可能是有害的社会，因为它有时被误认为是事实的新闻，由于其欺骗性的文字。我们发现，在讽刺新闻，上下文的词汇和语用属性是会哄读者的关键因素。在这项工作中，我们提出了区分讽刺新闻和真实新闻的方法。它利用讽刺书面证据充分利用两种语言模型预测的损失，一个训练有素的真实的新闻和其他的讽刺性新闻的区别，当赋予了新的新闻文章。我们计算语言模型预测损失的功能，然后将其用于进行下游分类的几个统计指标。因为语言模型捕捉讽刺新闻文档和传统新闻文档之间的语言使用的差异，当应用到他们的域以外的文档是敏感所提出的方法计算有效。

18. Sentiment Analysis on Social Media Content [PDF] 返回目录
Antony Samuels, John Mcgonicle
Abstract: Nowadays, people from all around the world use social media sites to share information. Twitter for example is a platform in which users send, read posts known as tweets and interact with different communities. Users share their daily lives, post their opinions on everything such as brands and places. Companies can benefit from this massive platform by collecting data related to opinions on them. The aim of this paper is to present a model that can perform sentiment analysis of real data collected from Twitter. Data in Twitter is highly unstructured which makes it difficult to analyze. However, our proposed model is different from prior work in this field because it combined the use of supervised and unsupervised machine learning algorithms. The process of performing sentiment analysis as follows: Tweet extracted directly from Twitter API, then cleaning and discovery of data performed. After that the data were fed into several models for the purpose of training. Each tweet extracted classified based on its sentiment whether it is a positive, negative or neutral. Data were collected on two subjects McDonalds and KFC to show which restaurant has more popularity. Different machine learning algorithms were used. The result from these models were tested using various testing metrics like cross validation and f-score. Moreover, our model demonstrates strong performance on mining texts extracted directly from Twitter.
摘要：如今，人们从世界各地的使用社交媒体网站来共享信息。 Twitter的例子是，用户发送的平台，阅读被称为微博和互动与不同社区的帖子。用户共享他们的日常生活，张贴他们的一切意见，如品牌和场所。企业可以从这个巨大的平台，通过收集有关他们的意见的数据中受益。本文的目的是提出能够执行从Twitter收集实际数据的情绪分析的模型。数据在Twitter是高度非结构化的，这使得它很难分析。然而，我们提出的模型是在这一领域以前的工作不同，因为它结合使用的监督和无监督的机器学习算法。如下执行情绪分析的方法：资料Tweet直接从微博API萃取，然后清洗和数据的发现进行的。之后，该数据被送入几个模型训练的目的。所提取的各分类鸣叫基于其情绪无论是正面，负面或中性。收集资料，对二级学科麦当劳和肯德基显示哪些餐厅有更多的人气。使用不同的机器学习算法。从这些模型的结果，使用不同的测试指标，例如交叉验证和f分数测试。此外，我们的模型表明直接从Twitter中提取文本挖掘性能强劲。

19. A Modern Non-SQL Approach to Radiology-Centric Search Engine Design with Clinical Validation [PDF] 返回目录
Ningcheng Li, Guy Maresh, Maxwell Cretcher, Khashayar Farsad, Ramsey Al-Hakim, John Kaufman, Judy Gichoya
Abstract: Healthcare data is increasing in size at an unprecedented speed with much attention on big data analysis and Artificial Intelligence application for quality assurance, clinical training, severity triaging, and decision support. Radiology is well-suited for innovation given its intrinsically paired linguistic and visual data. Previous attempts to unlock this information goldmine were encumbered by heterogeneity of human language, proprietary search algorithms, and lack of medicine-specific search performance matrices. We present a de novo process of developing a document-based, secure, efficient, and accurate search engine in the context of Radiology. We assess our implementation of the search engine with comparison to pre-existing manually collected clinical databases used previously for clinical research projects in addition to computational performance benchmarks and survey feedback. By leveraging efficient database architecture, search capability, and clinical thinking, radiologists are at the forefront of harnessing the power of healthcare data.
摘要：医疗保健数据的大小以前所未有的速度与大数据分析和质量保证，临床培训人工智能应用备受关注，严重检伤分类和决策支持增加。放射学是非常适合鉴于其内在的配对语言和视觉数据的创新。以前曾试图解开这个信息宝库是由人类的语言，专有的搜索算法，异质性担保和缺乏药品，具体的搜索性能矩阵。我们目前开发的放射的背景下，基于文档的，安全，高效，准确的搜索引擎的从头过程。我们评估我们实现搜索引擎与比较预先存在的，除了计算性能基准测试和问卷调查反馈的临床研究项目之前使用的人工收集的临床数据库。通过利用高效的数据库架构，搜索能力和临床思维，放射科医生是在利用医疗数据的力量的前列。

20. Pynsett: A programmable relation extractor [PDF] 返回目录
Alberto Cetoli
Abstract: This paper proposes a programmable relation extraction method for the English language by parsing texts into semantic graphs. A person can define rules in plain English that act as matching patterns onto the graph representation. These rules are designed to capture the semantic content of the documents, allowing for flexibility and ad-hoc entities. Relation extraction is a complex task that typically requires sizeable training corpora. The method proposed here is ideal for extracting specialized ontologies in a limited collection of documents.
摘要：本文提出了一种通过解析文本翻译成语义图英语的可编程关系抽取方法。一个人可以用简单的英语定义规则充当匹配模式到图表示。这些规则旨在捕捉文件的语义内容，允许灵活性和特设机构。关系抽取是一个复杂的任务，通常需要相当大的训练库。这里提出的方法是非常适用于文档的限量珍藏提取专门本体。

21. Low Rank Fusion based Transformers for Multimodal Sequences [PDF] 返回目录
Saurav Sahay, Eda Okur, Shachi H Kumar, Lama Nachman
Abstract: Our senses individually work in a coordinated fashion to express our emotional intentions. In this work, we experiment with modeling modality-specific sensory signals to attend to our latent multimodal emotional intentions and vice versa expressed via low-rank multimodal fusion and multimodal transformers. The low-rank factorization of multimodal fusion amongst the modalities helps represent approximate multiplicative latent signal interactions. Motivated by the work of~\cite{tsai2019MULT} and~\cite{Liu_2018}, we present our transformer-based cross-fusion architecture without any over-parameterization of the model. The low-rank fusion helps represent the latent signal interactions while the modality-specific attention helps focus on relevant parts of the signal. We present two methods for the Multimodal Sentiment and Emotion Recognition results on CMU-MOSEI, CMU-MOSI, and IEMOCAP datasets and show that our models have lesser parameters, train faster and perform comparably to many larger fusion-based architectures.
摘要：我们的感官单独以协调的方式工作，以表达我们的情绪的意图。在这项工作中，我们尝试用形态特异感觉信号建模来参加我们的潜在多式联运情绪的意图和副通过低阶多模态融合和多变压器反之亦然表示。多模态融合的低秩分解的模式之中有助于代表近似乘法潜信号相互作用。由〜工作的启发\ {引用} tsai2019MULT和〜\ {引用} Liu_2018，我们提出我们基于变压器的交叉融合架构没有模型的任何过度参数。低等级的融合有助于代表潜在的信号交互，而具体的方式，关注有助于专注于信号的相关部分。我们提出两种方法用于多情绪和情感识别结果的CMU-MOSEI，CMU-MOSI和IEMOCAP数据集，并表明我们的模型具有较小的参数，列车速度更快，可比执行许多更大的基于融合架构。

22. Text Data Augmentation: Towards better detection of spear-phishing emails [PDF] 返回目录
Mehdi Regina, Maxime Meyer, Sébastien Goutal
Abstract: Text data augmentation, i.e. the creation of synthetic textual data from an original text, is challenging as augmentation transformations should take into account language complexity while being relevant to the target Natural Language Processing (NLP) task (e.g. Machine Translation, Question Answering, Text Classification, etc.). Motivated by a business application of Business Email Compromise (BEC) detection, we propose a corpus and task agnostic text augmentation framework combining different methods, utilizing BERT language model, multi-step back-translation and heuristics. We show that our augmentation framework improves performances on several text classification tasks using publicly available models and corpora (SST2 and TREC) as well as on a BEC detection task. We also provide a comprehensive argumentation about the limitations of our augmentation framework.
摘要：文本数据增强，即合成的文本数据从原始文本的建构，是具有挑战性的，同时相关的目标自然语言处理（NLP）任务，增强转变应考虑到语言的复杂性（如机器翻译，问答系统，文本分类等）。通过商务电邮妥协（BEC）检测的业务应用的推动下，我们提出了语料库和任务无关的文本增强框架结合不同的方法，利用BERT语言模型，多步回译和启发。我们证明了我们的增强框架，提高了使用公开可用的模型和语料库（SST2和TREC），以及在BEC检测任务在多个文本分类任务性能。我们还提供了有关我们的增强框架的局限性进行全面论证。

23. Robust Prediction of Punctuation and Truecasingfor Medical ASR [PDF] 返回目录
Monica Sunkara, Srikanth Ronanki, Kalpit Dixit, Sravan Bodapati, Katrin Kirchhoff
Abstract: Automatic speech recognition (ASR) systems in the medical domain that focus on transcribing clinical dictations and doctor-patient conversations often pose many challenges due to the complexity of the domain. ASR output typically undergoes automatic punctuation to enable users to speak naturally, without having to vocalise awkward and explicit punctuation commands, such as "period", "add comma" or "exclamation point", while truecasing enhances user readability and improves the performance of downstream NLP tasks. This paper proposes a conditional joint modeling framework for prediction of punctuation and truecasing using pretrained masked language models such as BERT, BioBERT and RoBERTa. We also present techniques for domain and task specific adaptation by fine-tuning masked language models with medical domain data. Finally, we improve the robustness of the model against common errors made in ASR by performing data augmentation. Experiments performed on dictation and conversational style corpora show that our proposed model achieves ~5% absolute improvement on ground truth text and ~10% improvement on ASR outputs over baseline models under F1 metric.
摘要：自动语音识别（ASR），在医疗领域，专注于转录临床默和医患对话，往往对由于域的复杂许多挑战系统。 ASR输出通常经历自动标点符号以使用户能够自然地说话，而不必练声笨拙和明确的标点符号的命令，如“期间”，“加逗号”或“惊叹号”，而truecasing提高了用户的可读性，并改善下游的性能NLP任务。本文提出了标点符号的预测，并使用预先训练掩盖语言模型，如BERT，BioBERT和罗伯塔truecasing有条件联合建模框架。我们还通过微调域和任务具体适应本发明的技术掩盖语言模型与医疗领域的数据。最后，我们提高对通过执行数据增强在ASR作出常见错误模型的鲁棒性。在听写和对话的方式进行语料实验表明，该模型实现了〜对地面实况文本的5％绝对改善和F1度量标准下的ASR输出〜10％的改善超过基线模型。

24. El Departamento de Nosotros: How Machine Translated Corpora Affects Language Models in MRC Tasks [PDF] 返回目录
Maria Khvalchik, Mikhail Galkin
Abstract: Pre-training large-scale language models (LMs) requires huge amounts of text corpora. LMs for English enjoy ever growing corpora of diverse language resources. However, less resourced languages and their mono- and multilingual LMs often struggle to obtain bigger datasets. A typical approach in this case implies using machine translation of English corpora to a target language. In this work, we study the caveats of applying directly translated corpora for fine-tuning LMs for downstream natural language processing tasks and demonstrate that careful curation along with post-processing lead to improved performance and overall LMs robustness. In the empirical evaluation, we perform a comparison of directly translated against curated Spanish SQuAD datasets on both user and system levels. Further experimental results on XQuAD and MLQA transfer-learning evaluation question answering tasks show that presumably multilingual LMs exhibit more resilience to machine translation artifacts in terms of the exact match score.
摘要：前培训的大型语言模型（LMS）需要语料库的巨额资金。对于英语的LM享受不同语言资源不断增长的语料库。然而，资源少的语言和它们的单和多语种的LM往往很难获得更大的数据集。在这种情况下，一个典型的做法意味着使用英语语料库的机器翻译为目标语言。在这项工作中，我们研究应用直接翻译语料库进行微调的LM下游自然语言处理任务的警告，并表明精心策展与后处理以改善性能和整体LM的健壮性一起。在实证分析中，我们执行的直接翻译对在用户和系统级别策划西班牙队的数据集进行比较。在XQuAD和MLQA转移学习评价答疑任务进一步的实验结果表明，大概是多语种的LM在精确匹配分数方面表现较为抗跌机器翻译的文物。

25. Abstractive and mixed summarization for long-single documents [PDF] 返回目录
Roger Barrull, Jugal Kalita
Abstract: The lack of diversity in the datasets available for automatic summarization of documents has meant that the vast majority of neural models for automatic summarization have been trained with news articles. These datasets are relatively small, with an average size of about 600 words, and the models trained with such data sets see their performance limited to short documents. In order to surmount this problem, this paper uses scientific papers as the dataset on which different models are trained. These models have been chosen based on their performance on the CNN/Daily Mail data set, so that the highest ranked model of each architectural variant is selected. In this work, six different models are compared, two with an RNN architecture, one with a CNN architecture, two with a Transformer architecture and one with a Transformer architecture combined with reinforcement learning. The results from this work show that those models that use a hierarchical encoder to model the structure of the document has a better performance than the rest.
摘要：缺乏可用于文档自动文摘数据集的多样性意味着自动汇总绝大多数神经车型已接受培训的新闻文章。这些数据集都比较小，约600字的平均尺寸，以及与这些数据集训练模型看到他们的表现仅限于短文件。为了克服这个问题，本文采用的科学论文作为其上不同的模型训练数据集。基于它们对美国有线电视新闻网/每日邮报数据集性能这些模型已选择，使每个建筑的变体的排名最高的机型选择。在这项工作中，六个不同车型进行比较，两名与RNN架构，一个有CNN架构，两面带变压器的架构和一个带有变压器架构，强化学习相结合。这项工作表明，那些使用分层编码器模型文档的结构模型的结果具有比其他人有更好的表现。

26. On the Evolution of Programming Languages [PDF] 返回目录
K. R. Chowdhary
Abstract: This paper attempts to connects the evolution of computer languages with the evolution of life, where the later has been dictated by \emph{theory of evolution of species}, and tries to give supportive evidence that the new languages are more robust than the previous, carry-over the mixed features of older languages, such that strong features gets added into them and weak features of older languages gets removed. In addition, an analysis of most prominent programming languages is presented, emphasizing on how the features of existing languages have influenced the development of new programming languages. At the end, it suggests a set of experimental languages, which may rule the world of programming languages in the time of new multi-core architectures. Index terms- Programming languages' evolution, classifications of languages, future languages, scripting-languages.
摘要：本文尝试连接着生命，进化的计算机语言，其中后者已经被{物种进化的理论}决定由\ EMPH，并尝试演变给予支持性证据，新的语言比更强大以前，结转旧的语言的混合功能，这样强劲的功能被添加到他们及以上语言的弱势特征被删除。此外，最突出的编程语言的分析提出，强调如何现有语言的特点影响了新的编程语言的发展。最后，它提出了一系列的实验语言，它可以在规则的新的多内核架构的时间编程语言的世界。关键词 - 编程语言的演变，语言，未来的语言，脚本语言的分类。

27. Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training [PDF] 返回目录
Yingwei Pan, Yehao Li, Jianjie Luo, Jun Xu, Ting Yao, Tao Mei
Abstract: In this work, we present Auto-captions on GIF, which is a new large-scale pre-training dataset for generic video understanding. All video-sentence pairs are created by automatically extracting and filtering video caption annotations from billions of web pages. Auto-captions on GIF dataset can be utilized to pre-train the generic feature representation or encoder-decoder structure for video captioning, and other downstream tasks (e.g., sentence localization in videos, video question answering, etc.) as well. We present a detailed analysis of Auto-captions on GIF dataset in comparison to existing video-sentence datasets. We also provide an evaluation of a Transformer-based encoder-decoder structure for vision-language pre-training, which is further adapted to video captioning downstream task and yields the compelling generalizability on MSR-VTT. The dataset is available at \url{http://www.auto-video-captions.top/2020/dataset}.
摘要：在这项工作中，我们提出的GIF，这是一般视频了解一个新的大型预训练数据集自动字幕。所有视频句对被自动提取，并从数十亿网页过滤视频标题注释创建。上GIF数据集自动字幕可被用于预培养的一般特征表示或用于视频字幕编码器 - 解码器的结构，和其他下游任务（例如，句子定位在视频，视频答疑等）为好。我们目前的GIF数据集自动字幕的详细分析，相较于现有的视频句子的数据集。我们还提供基于变压器的编码解码器结构的视觉语言训练前，这还适用于视频字幕下游任务，并产生对MSR-VTT引人注目的普遍性的评估。该数据集可在\ {URL} http://www.auto-video-captions.top/2020/dataset。

28. Starfish: A Prototype for Universal Preprocessing and Text-Embedded Programming [PDF] 返回目录
Vlado Keselj
Abstract: We present a novel concept of universal text preprocessing and text-embedded programming (PTEP). Preprocessing and text-embedded programming has been widely used in programming languages and frameworks in a fragmented and mutually isolated way. The PTEP ideas can be found in the implementation of the \TeX\ typesetting system; they are prominent in PHP and similar web languages, and finally they are used in the Jupyter data science framework. This paper presents this area of research and related work in a more unified framework, and we describe the implemented system Starfish that satisfies the following novel principles of PTEP: universality, update and replace modes, flexiblity, configurability, and transparency. We describe the operating model and design of Starfish, which is an open-source system implementing universal preprocessing and text-embedded programming in Perl. The system is transparent and its design allows direct implementation in other programming languages as well.
摘要：我们提出的通用文本预处理和文本嵌入式编程（PTEP）的新概念。预处理和文本嵌入式编程已被广泛应用于在分散和相互隔离的方式编程语言和框架。该PTEP想法可以在\ TeX的\排版系统的实施来找到;他们是在PHP和类似的网络语言突出，最后他们在Jupyter数据的科学框架中使用。本文介绍了研究和相关工作这方面更统一的框架，我们描述了实现的系统海星满足PTEP以下新的原则：普遍性，更新和替换方式，flexiblity，可配置性和透明度。我们描述海星，这是实现在Perl通用预处理和文本的嵌入式编程一个开源系统的运营模式和设计。该系统是透明的，它的设计允许其他编程语言直接实现为好。

29. Proving Non-Inclusion of Büchi Automata based on Monte Carlo Sampling [PDF] 返回目录
Yong Li, Andrea Turrini, Andrea Turrini, Lijun Zhang
Abstract: The search for a proof of correctness and the search for counterexamples (bugs) are complementary aspects of verification. In order to maximize the practical use of verification tools it is better to pursue them at the same time. While this is well-understood in the termination analysis of programs, this is not the case for the language inclusion analysis of Büchi automata, where research mainly focused on improving algorithms for proving language inclusion, with the search for counterexamples left to the expensive complementation operation. In this paper, we present $\mathsf{IMC}^2$, a specific algorithm for proving Büchi automata non-inclusion $\mathcal{L}(\mathcal{A}) \not\subseteq \mathcal{L}(\mathcal{B})$, based on Grosu and Smolka's algorithm $\mathsf{MC}^2$ developed for Monte Carlo model checking against LTL formulas. The algorithm we propose takes $M = \lceil \ln \delta / \ln (1-\epsilon) \rceil$ random lasso-shaped samples from $\mathcal{A}$ to decide whether to reject the hypothesis $\mathcal{L}(\mathcal{A}) \not\subseteq \mathcal{L}(\mathcal{B})$, for given error probability $\epsilon$ and confidence level $1 - \delta$. With such a number of samples, $\mathsf{IMC}^2$ ensures that the probability of witnessing $\mathcal{L}(\mathcal{A}) \not\subseteq \mathcal{L}(\mathcal{B})$ via further sampling is less than $\delta$, under the assumption that the probability of finding a lasso counterexample is larger than $\epsilon$. Extensive experimental evaluation shows that $\mathsf{IMC}^2$ is a fast and reliable way to find counterexamples to Büchi automata inclusion.
摘要：正确性的证明和寻找反例（错误）的搜索是验证的互补方面。为了最大限度地提高实际使用的验证工具，最好是在同一时间去追求。虽然这是在程序终止分析很好理解的，这不是步琪自动机，其中的研究主要集中在提高算法的证明语言的包容性，与搜索留给昂贵的互补操作的反语言夹杂物分析的情况下。在本文中，我们本$ \ mathsf {IMC} ^ 2 $，用于证明步琪自动机非包合$ \ mathcal {L}的特定算法（\ mathcal {A}）\不\ subseteq \ mathcal {L}（\ mathcal {B}）$，基于Grosu和Smolka'酒店享有算法$ \ {mathsf MC} ^ 2 $针对LTL公式蒙特卡罗模型检验的发展。我们提出的算法以$ M = \ lceil \ LN \增量/ \ LN（1- \小量）\ rceil $从$随机套索状样品\ mathcal {A} $来决定是否拒绝假设$ \ mathcal { L}（\ mathcal {A}）\不\ subseteq \ mathcal {L}（\ mathcal {B}）$，对于给定的错误概率$ \小量$和置信水平$ -1 - \增量$。利用这样的数量的样本，$ \ mathsf {IMC} ^ 2 $确保目睹$ \ mathcal {L}的概率（\ mathcal {A}）\不\ subseteq \ mathcal {L}（\ mathcal {B}通过进一步取样）$小于$ \增量$，假设找到一个套索反的概率大于$ \小量$下。大量的实验评价结果显示，$ \ {mathsf IMC} ^ 2 $是一种快速找到反例步琪自动列入可靠的方法。

30. You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion [PDF] 返回目录
Roei Schuster, Congzheng Song, Eran Tromer, Vitaly Shmatikov
Abstract: Code autocompletion is an integral feature of modern code editors and IDEs. The latest generation of autocompleters uses neural language models, trained on public open-source code repositories, to suggest likely (not just statically feasible) completions given the current context. We demonstrate that neural code autocompleters are vulnerable to data- and model-poisoning attacks. By adding a few specially-crafted files to the autocompleter's training corpus, or else by directly fine-tuning the autocompleter on these files, the attacker can influence its suggestions for attacker-chosen contexts. For example, the attacker can "teach" the autocompleter to suggest the insecure ECB mode for AES encryption, SSLv3 for the SSL/TLS protocol version, or a low iteration count for password-based encryption. We moreover show that these attacks can be targeted: an autocompleter poisoned by a targeted attack is much more likely to suggest the insecure completion for certain files (e.g., those from a specific repo). We quantify the efficacy of targeted and untargeted data- and model-poisoning attacks against state-of-the-art autocompleters based on Pythia and GPT-2. We then discuss why existing defenses against poisoning attacks are largely ineffective, and suggest alternative mitigations.
摘要：代码自动完成是现代的代码编辑器和IDE的完整功能。最新一代autocompleters的使用神经语言模型，培养公众开放的源代码库，鉴于当前上下文暗示可能（而不仅仅是静态可行）落成。我们表明，神经代码autocompleters很容易受到数据和模型中毒攻击。通过直接微调对这些文件的autocompleter增加了一些特制文件到autocompleter的训练语料，否则，攻击者可以影响为攻击者选择的背景下其建议。例如，攻击者可以“教导”的autocompleter建议用于AES加密的不安全ECB模式，在SSLv3用于SSL / TLS协议的版本，或者对于基于口令的加密的低迭代计数。此外，我们表明，这些攻击可以有针对性的：通过有针对性的攻击中毒的autocompleter更可能暗示了某些文件不安全完成（例如，那些从特定回购）。我们量化对基于皮提亚和GPT-2的国家的最先进的autocompleters定位和未定位数据和模型中毒攻击的功效。然后，我们讨论为什么对中毒攻击现有的防御是基本无效，并提出替代缓解。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-07-07

目录

摘要