目录
3. EmotionGIF-IITP-AINLPML: Ensemble-based Automated Deep Neural System for predicting category(ies) of a GIF response [PDF] 摘要
10. TicketTalk: Toward human-level performance with end-to-end, transaction-based dialog systems [PDF] 摘要
14. Confronting Abusive Language Online: A Survey from the Ethical and Human Rights Perspective [PDF] 摘要
摘要
1. A Multimodal Framework for the Detection of Hateful Memes [PDF] 返回目录
Phillip Lippe, Nithin Holla, Shantanu Chandra, Santhosh Rajamanickam, Georgios Antoniou, Ekaterina Shutova, Helen Yannakoudakis
Abstract: An increasingly common expression of online hate speech is multimodal in nature and comes in the form of memes. Designing systems to automatically detect hateful content is of paramount importance if we are to mitigate its undesirable effects on the society at large. The detection of multimodal hate speech is an intrinsically difficult and open problem: memes convey a message using both images and text and, hence, require multimodal reasoning and joint visual and language understanding. In this work, we seek to advance this line of research and develop a multimodal framework for the detection of hateful memes. We improve the performance of existing multimodal approaches beyond simple fine-tuning and, among others, show the effectiveness of upsampling of contrastive examples to encourage multimodality and ensemble learning based on cross-validation to improve robustness. We furthermore analyze model misclassifications and discuss a number of hypothesis-driven augmentations and their effects on performance, presenting important implications for future research in the field. Our best approach comprises an ensemble of UNITER-based models and achieves an AUROC score of 80.53, placing us 4th on phase 2 of the 2020 Hateful Memes Challenge organized by Facebook.
摘要:网上仇恨言论越来越普遍,本质上是多模式的,以模因的形式出现。如果我们要减轻其对整个社会的不良影响,那么设计自动检测仇恨内容的系统至关重要。多模式仇恨语音的检测是一个内在的难题和开放性问题:模因使用图像和文本来传达消息,因此需要多模式推理以及共同的视觉和语言理解。在这项工作中,我们寻求推进这一研究领域,并开发出一种用于检测仇恨模因的多模式框架。除了简单的微调之外,我们还改善了现有多模式方法的性能,并且展示了对示例进行升采样以鼓励多模式和基于交叉验证的整体学习以提高鲁棒性的有效性。我们还分析了模型的错误分类,并讨论了许多假设驱动的扩充及其对性能的影响,为该领域的未来研究提供了重要的启示。我们的最佳方法包括基于UNITER的模型的合奏,并获得80.53的AUROC评分,使我们在Facebook组织的2020 Hateful Memes Challenge第二阶段排名第四。
Phillip Lippe, Nithin Holla, Shantanu Chandra, Santhosh Rajamanickam, Georgios Antoniou, Ekaterina Shutova, Helen Yannakoudakis
Abstract: An increasingly common expression of online hate speech is multimodal in nature and comes in the form of memes. Designing systems to automatically detect hateful content is of paramount importance if we are to mitigate its undesirable effects on the society at large. The detection of multimodal hate speech is an intrinsically difficult and open problem: memes convey a message using both images and text and, hence, require multimodal reasoning and joint visual and language understanding. In this work, we seek to advance this line of research and develop a multimodal framework for the detection of hateful memes. We improve the performance of existing multimodal approaches beyond simple fine-tuning and, among others, show the effectiveness of upsampling of contrastive examples to encourage multimodality and ensemble learning based on cross-validation to improve robustness. We furthermore analyze model misclassifications and discuss a number of hypothesis-driven augmentations and their effects on performance, presenting important implications for future research in the field. Our best approach comprises an ensemble of UNITER-based models and achieves an AUROC score of 80.53, placing us 4th on phase 2 of the 2020 Hateful Memes Challenge organized by Facebook.
摘要:网上仇恨言论越来越普遍,本质上是多模式的,以模因的形式出现。如果我们要减轻其对整个社会的不良影响,那么设计自动检测仇恨内容的系统至关重要。多模式仇恨语音的检测是一个内在的难题和开放性问题:模因使用图像和文本来传达消息,因此需要多模式推理以及共同的视觉和语言理解。在这项工作中,我们寻求推进这一研究领域,并开发出一种用于检测仇恨模因的多模式框架。除了简单的微调之外,我们还改善了现有多模式方法的性能,并且展示了对示例进行升采样以鼓励多模式和基于交叉验证的整体学习以提高鲁棒性的有效性。我们还分析了模型的错误分类,并讨论了许多假设驱动的扩充及其对性能的影响,为该领域的未来研究提供了重要的启示。我们的最佳方法包括基于UNITER的模型的合奏,并获得80.53的AUROC评分,使我们在Facebook组织的2020 Hateful Memes Challenge第二阶段排名第四。
2. Automatic Scansion of Spanish Poetry without Syllabification [PDF] 返回目录
Guillermo Marco Remón, Julio Gonzalo
Abstract: In recent years, several systems of automated metric analysis of Spanish poetry have emerged. These systems rely on complex methods of syllabification and stress assignment, which use PoS-tagging libraries, whose computational cost is high. This cost increases with the calculation of metric ambiguities. Furthermore, they do not consider determining issues in syllabic count such as the phenomena of compensation between hemistichs of verses of more than eleven syllables. However, it is possible to carry out an informative and accurate metric analysis without using these costly methods. We propose an algorithm that performs accurate scansion (number of syllables, stress pattern and type of verse) without syllabification. It addresses metric ambiguities and takes into account the hemistichs compensation. Our algorithm outperforms the current state of the art by 2% in fixed-metre poetry, and 25% in mixed-metre poetry. It also runs 21 and 25 times faster, respectively. Finally, a desktop application is offered as a tool for researchers of Spanish poetry.
摘要:近年来,出现了几种西班牙诗歌的自动度量分析系统。这些系统依赖于复杂的音节化和压力分配方法,这些方法使用PoS标签库,其计算成本很高。随着度量歧义度的计算,此成本增加。此外,他们不考虑确定音节计数中的问题,例如超过11个音节的经节之间的补偿现象。但是,可以在不使用这些昂贵方法的情况下进行信息丰富且准确的度量分析。我们提出了一种无需进行音节化即可执行准确扫描(音节数,重音模式和诗歌类型)的算法。它解决了度量歧义,并考虑了偏差补偿。我们的算法在固定米诗歌中的表现优于当前技术水平,分别为2%和25%。它还运行速度分别提高了21和25倍。最后,桌面应用程序作为工具提供给西班牙诗歌研究人员。
Guillermo Marco Remón, Julio Gonzalo
Abstract: In recent years, several systems of automated metric analysis of Spanish poetry have emerged. These systems rely on complex methods of syllabification and stress assignment, which use PoS-tagging libraries, whose computational cost is high. This cost increases with the calculation of metric ambiguities. Furthermore, they do not consider determining issues in syllabic count such as the phenomena of compensation between hemistichs of verses of more than eleven syllables. However, it is possible to carry out an informative and accurate metric analysis without using these costly methods. We propose an algorithm that performs accurate scansion (number of syllables, stress pattern and type of verse) without syllabification. It addresses metric ambiguities and takes into account the hemistichs compensation. Our algorithm outperforms the current state of the art by 2% in fixed-metre poetry, and 25% in mixed-metre poetry. It also runs 21 and 25 times faster, respectively. Finally, a desktop application is offered as a tool for researchers of Spanish poetry.
摘要:近年来,出现了几种西班牙诗歌的自动度量分析系统。这些系统依赖于复杂的音节化和压力分配方法,这些方法使用PoS标签库,其计算成本很高。随着度量歧义度的计算,此成本增加。此外,他们不考虑确定音节计数中的问题,例如超过11个音节的经节之间的补偿现象。但是,可以在不使用这些昂贵方法的情况下进行信息丰富且准确的度量分析。我们提出了一种无需进行音节化即可执行准确扫描(音节数,重音模式和诗歌类型)的算法。它解决了度量歧义,并考虑了偏差补偿。我们的算法在固定米诗歌中的表现优于当前技术水平,分别为2%和25%。它还运行速度分别提高了21和25倍。最后,桌面应用程序作为工具提供给西班牙诗歌研究人员。
3. EmotionGIF-IITP-AINLPML: Ensemble-based Automated Deep Neural System for predicting category(ies) of a GIF response [PDF] 返回目录
Soumitra Ghosh, Arkaprava Roy, Asif Ekbal, Pushpak Bhattacharyya
Abstract: In this paper, we describe the systems submitted by our IITP-AINLPML team in the shared task of SocialNLP 2020, EmotionGIF 2020, on predicting the category(ies) of a GIF response for a given unlabelled tweet. For the round 1 phase of the task, we propose an attention-based Bi-directional GRU network trained on both the tweet (text) and their replies (text wherever available) and the given category(ies) for its GIF response. In the round 2 phase, we build several deep neural-based classifiers for the task and report the final predictions through a majority voting based ensemble technique. Our proposed models attain the best Mean Recall (MR) scores of 52.92% and 53.80% in round 1 and round 2, respectively.
摘要:在本文中,我们描述了IITP-AINLPML团队在SocialNLP 2020的共同任务EmotionGIF 2020中提交的有关预测给定未标记推文的GIF响应类别的系统。 对于任务的第1阶段,我们提出了一个基于注意力的双向GRU网络,该网络在推文(文本)及其回复(文本在任何可用的地方)以及GIF响应的给定类别上进行训练。 在第2阶段中,我们为任务建立了多个基于深度神经网络的分类器,并通过基于多数投票的集成技术报告了最终的预测。 我们提出的模型在第1轮和第2轮中分别获得52.92%和53.80%的最佳平均召回率(MR)分数。
Soumitra Ghosh, Arkaprava Roy, Asif Ekbal, Pushpak Bhattacharyya
Abstract: In this paper, we describe the systems submitted by our IITP-AINLPML team in the shared task of SocialNLP 2020, EmotionGIF 2020, on predicting the category(ies) of a GIF response for a given unlabelled tweet. For the round 1 phase of the task, we propose an attention-based Bi-directional GRU network trained on both the tweet (text) and their replies (text wherever available) and the given category(ies) for its GIF response. In the round 2 phase, we build several deep neural-based classifiers for the task and report the final predictions through a majority voting based ensemble technique. Our proposed models attain the best Mean Recall (MR) scores of 52.92% and 53.80% in round 1 and round 2, respectively.
摘要:在本文中,我们描述了IITP-AINLPML团队在SocialNLP 2020的共同任务EmotionGIF 2020中提交的有关预测给定未标记推文的GIF响应类别的系统。 对于任务的第1阶段,我们提出了一个基于注意力的双向GRU网络,该网络在推文(文本)及其回复(文本在任何可用的地方)以及GIF响应的给定类别上进行训练。 在第2阶段中,我们为任务建立了多个基于深度神经网络的分类器,并通过基于多数投票的集成技术报告了最终的预测。 我们提出的模型在第1轮和第2轮中分别获得52.92%和53.80%的最佳平均召回率(MR)分数。
4. Negation in Cognitive Reasoning [PDF] 返回目录
Claudia Schon, Sophie Siebert, Frieder Stolzenburg
Abstract: Negation is both an operation in formal logic and in natural language by which a proposition is replaced by one stating the opposite, as by the addition of "not" or another negation cue. Treating negation in an adequate way is required for cognitive reasoning, which comprises commonsense reasoning and text comprehension. One task of cognitive reasoning is answering questions given by sentences in natural language. There are tools based on discourse representation theory to convert sentences automatically into a formal logical representation. However, since the knowledge in logical databases in practice always is incomplete, forward reasoning of automated reasoning systems alone does not suffice to derive answers to questions because, instead of complete proofs, often only partial positive knowledge can be derived. In consequence, negative information from negated expressions does not help in this context, because only negative knowledge can be derived from this. Therefore, we aim at reducing syntactic negation, strictly speaking, the negated event or property, to its inverse. This lays the basis of cognitive reasoning employing both logic and machine learning for general question answering. In this paper, we describe an effective procedure to determine the negated event or property in order to replace it with it inverse and our overall system for cognitive reasoning. We demonstrate the procedure with examples and evaluate it with several benchmarks.
摘要:否定既是形式逻辑又是自然语言的一种操作,用一个陈述相反的词代替命题,例如加上“非”或另一种否定提示。认知推理需要适当处理否定,包括常识推理和文本理解。认知推理的一项任务是用自然语言回答句子中给出的问题。有一些基于话语表示理论的工具可以将句子自动转换为形式逻辑表示。但是,由于实际上逻辑数据库中的知识始终是不完整的,因此仅靠自动推理系统的正向推理不足以得出问题的答案,因为代替完整的证明,通常只能得出部分肯定的知识。结果,来自否定表达式的否定信息在这种情况下无济于事,因为只能从中得出否定知识。因此,我们的目标是将语法否定(严格地说是否定的事件或属性)减少为逆。这奠定了使用逻辑和机器学习进行一般问题解答的认知推理的基础。在本文中,我们描述了一种确定否定事件或属性的有效程序,以便用它来代替它,以及我们的整个认知推理系统。我们用示例演示该过程,并用几个基准对其进行评估。
Claudia Schon, Sophie Siebert, Frieder Stolzenburg
Abstract: Negation is both an operation in formal logic and in natural language by which a proposition is replaced by one stating the opposite, as by the addition of "not" or another negation cue. Treating negation in an adequate way is required for cognitive reasoning, which comprises commonsense reasoning and text comprehension. One task of cognitive reasoning is answering questions given by sentences in natural language. There are tools based on discourse representation theory to convert sentences automatically into a formal logical representation. However, since the knowledge in logical databases in practice always is incomplete, forward reasoning of automated reasoning systems alone does not suffice to derive answers to questions because, instead of complete proofs, often only partial positive knowledge can be derived. In consequence, negative information from negated expressions does not help in this context, because only negative knowledge can be derived from this. Therefore, we aim at reducing syntactic negation, strictly speaking, the negated event or property, to its inverse. This lays the basis of cognitive reasoning employing both logic and machine learning for general question answering. In this paper, we describe an effective procedure to determine the negated event or property in order to replace it with it inverse and our overall system for cognitive reasoning. We demonstrate the procedure with examples and evaluate it with several benchmarks.
摘要:否定既是形式逻辑又是自然语言的一种操作,用一个陈述相反的词代替命题,例如加上“非”或另一种否定提示。认知推理需要适当处理否定,包括常识推理和文本理解。认知推理的一项任务是用自然语言回答句子中给出的问题。有一些基于话语表示理论的工具可以将句子自动转换为形式逻辑表示。但是,由于实际上逻辑数据库中的知识始终是不完整的,因此仅靠自动推理系统的正向推理不足以得出问题的答案,因为代替完整的证明,通常只能得出部分肯定的知识。结果,来自否定表达式的否定信息在这种情况下无济于事,因为只能从中得出否定知识。因此,我们的目标是将语法否定(严格地说是否定的事件或属性)减少为逆。这奠定了使用逻辑和机器学习进行一般问题解答的认知推理的基础。在本文中,我们描述了一种确定否定事件或属性的有效程序,以便用它来代替它,以及我们的整个认知推理系统。我们用示例演示该过程,并用几个基准对其进行评估。
5. Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing [PDF] 返回目录
Xi Victoria Lin, Richard Socher, Caiming Xiong
Abstract: We present BRIDGE, a powerful sequential architecture for modeling dependencies between natural language questions and relational databases in cross-DB semantic parsing. BRIDGE represents the question and DB schema in a tagged sequence where a subset of the fields are augmented with cell values mentioned in the question. The hybrid sequence is encoded by BERT with minimal subsequent layers and the text-DB contextualization is realized via the fine-tuned deep attention in BERT. Combined with a pointer-generator decoder with schema-consistency driven search space pruning, BRIDGE attained state-of-the-art performance on popular cross-DB text-to-SQL benchmarks, Spider (71.1\% dev, 67.5\% test with ensemble model) and WikiSQL (92.6\% dev, 91.9\% test). Our analysis shows that BRIDGE effectively captures the desired cross-modal dependencies and has the potential to generalize to more text-DB related tasks. Our implementation is available at \url{this https URL}.
摘要:我们提出了BRIDGE,这是一种功能强大的顺序体系结构,用于在跨数据库语义解析中对自然语言问题与关系数据库之间的依赖关系进行建模。 BRIDGE以标记序列表示问题和数据库模式,其中字段的子集使用问题中提到的单元格值进行扩充。 混合序列由BERT编码,并具有最少的后续层,并且通过BERT中微调的高度关注来实现text-DB上下文化。 结合具有模式一致性驱动的搜索空间修剪的指针生成器解码器,BRIDGE在流行的跨DB文本到SQL基准,Spider(71.1 \%dev,67.5 \%test)下获得了最先进的性能。 集成模型)和WikiSQL(92.6 \%dev,91.9 \%test)。 我们的分析表明,BRIDGE有效地捕获了所需的跨模式依赖关系,并具有推广到更多与文本数据库相关的任务的潜力。 我们的实现位于\ url {this https URL}。
Xi Victoria Lin, Richard Socher, Caiming Xiong
Abstract: We present BRIDGE, a powerful sequential architecture for modeling dependencies between natural language questions and relational databases in cross-DB semantic parsing. BRIDGE represents the question and DB schema in a tagged sequence where a subset of the fields are augmented with cell values mentioned in the question. The hybrid sequence is encoded by BERT with minimal subsequent layers and the text-DB contextualization is realized via the fine-tuned deep attention in BERT. Combined with a pointer-generator decoder with schema-consistency driven search space pruning, BRIDGE attained state-of-the-art performance on popular cross-DB text-to-SQL benchmarks, Spider (71.1\% dev, 67.5\% test with ensemble model) and WikiSQL (92.6\% dev, 91.9\% test). Our analysis shows that BRIDGE effectively captures the desired cross-modal dependencies and has the potential to generalize to more text-DB related tasks. Our implementation is available at \url{this https URL}.
摘要:我们提出了BRIDGE,这是一种功能强大的顺序体系结构,用于在跨数据库语义解析中对自然语言问题与关系数据库之间的依赖关系进行建模。 BRIDGE以标记序列表示问题和数据库模式,其中字段的子集使用问题中提到的单元格值进行扩充。 混合序列由BERT编码,并具有最少的后续层,并且通过BERT中微调的高度关注来实现text-DB上下文化。 结合具有模式一致性驱动的搜索空间修剪的指针生成器解码器,BRIDGE在流行的跨DB文本到SQL基准,Spider(71.1 \%dev,67.5 \%test)下获得了最先进的性能。 集成模型)和WikiSQL(92.6 \%dev,91.9 \%test)。 我们的分析表明,BRIDGE有效地捕获了所需的跨模式依赖关系,并具有推广到更多与文本数据库相关的任务的潜力。 我们的实现位于\ url {this https URL}。
6. Learning Dense Representations of Phrases at Scale [PDF] 返回目录
Jinhyuk Lee, Mujeen Sung, Jaewoo Kang, Danqi Chen
Abstract: Open-domain question answering can be reformulated as a phrase retrieval problem, without the need for processing documents on-demand during inference (Seo et al., 2019). However, current phrase retrieval models heavily depend on their sparse representations while still underperforming retriever-reader approaches. In this work, we show for the first time that we can learn dense phrase representations alone that achieve much stronger performance in open-domain QA. Our approach includes (1) learning query-agnostic phrase representations via question generation and distillation; (2) novel negative-sampling methods for global normalization; (3) query-side fine-tuning for transfer learning. On five popular QA datasets, our model DensePhrases improves previous phrase retrieval models by 15%-25% absolute accuracy and matches the performance of state-of-the-art retriever-reader models. Our model is easy to parallelize due to pure dense representations and processes more than 10 questions per second on CPUs. Finally, we directly use our pre-indexed dense phrase representations for two slot filling tasks, showing the promise of utilizing DensePhrases as a dense knowledge base for downstream tasks.
摘要:开放域问答可以重新定义为短语检索问题,而无需在推理过程中按需处理文档(Seo等,2019)。但是,当前的短语检索模型在很大程度上依赖于它们的稀疏表示,而仍不如检索器-阅读器方法那样好。在这项工作中,我们首次展示了我们可以单独学习密集短语表示,从而在开放域QA中获得更强大的性能。我们的方法包括(1)通过问题生成和提炼学习与查询无关的短语表示; (2)用于全局归一化的新的负采样方法; (3)查询端微调,用于迁移学习。在五个流行的QA数据集上,我们的模型DensePhrases将以前的短语检索模型的绝对准确度提高了15%-25%,并与最新的检索器-阅读器模型的性能相匹配。由于纯粹的密集表示,我们的模型易于并行化,并且在CPU上每秒处理10个以上的问题。最后,我们直接将预索引的密集短语表示形式用于两个时隙填充任务,这表明了利用DensePhrases作为下游任务的密集知识库的希望。
Jinhyuk Lee, Mujeen Sung, Jaewoo Kang, Danqi Chen
Abstract: Open-domain question answering can be reformulated as a phrase retrieval problem, without the need for processing documents on-demand during inference (Seo et al., 2019). However, current phrase retrieval models heavily depend on their sparse representations while still underperforming retriever-reader approaches. In this work, we show for the first time that we can learn dense phrase representations alone that achieve much stronger performance in open-domain QA. Our approach includes (1) learning query-agnostic phrase representations via question generation and distillation; (2) novel negative-sampling methods for global normalization; (3) query-side fine-tuning for transfer learning. On five popular QA datasets, our model DensePhrases improves previous phrase retrieval models by 15%-25% absolute accuracy and matches the performance of state-of-the-art retriever-reader models. Our model is easy to parallelize due to pure dense representations and processes more than 10 questions per second on CPUs. Finally, we directly use our pre-indexed dense phrase representations for two slot filling tasks, showing the promise of utilizing DensePhrases as a dense knowledge base for downstream tasks.
摘要:开放域问答可以重新定义为短语检索问题,而无需在推理过程中按需处理文档(Seo等,2019)。但是,当前的短语检索模型在很大程度上依赖于它们的稀疏表示,而仍不如检索器-阅读器方法那样好。在这项工作中,我们首次展示了我们可以单独学习密集短语表示,从而在开放域QA中获得更强大的性能。我们的方法包括(1)通过问题生成和提炼学习与查询无关的短语表示; (2)用于全局归一化的新的负采样方法; (3)查询端微调,用于迁移学习。在五个流行的QA数据集上,我们的模型DensePhrases将以前的短语检索模型的绝对准确度提高了15%-25%,并与最新的检索器-阅读器模型的性能相匹配。由于纯粹的密集表示,我们的模型易于并行化,并且在CPU上每秒处理10个以上的问题。最后,我们直接将预索引的密集短语表示形式用于两个时隙填充任务,这表明了利用DensePhrases作为下游任务的密集知识库的希望。
7. Automated Lay Language Summarization of Biomedical Scientific Reviews [PDF] 返回目录
Yue Guo, Wei Qiu, Yizhong Wang, Trevor Cohen
Abstract: Health literacy has emerged as a crucial factor in making appropriate health decisions and ensuring treatment outcomes. However, medical jargon and the complex structure of professional language in this domain make health information especially hard to interpret. Thus, there is an urgent unmet need for automated methods to enhance the accessibility of the biomedical literature to the general population. This problem can be framed as a type of translation problem between the language of healthcare professionals, and that of the general public. In this paper, we introduce the novel task of automated generation of lay language summaries of biomedical scientific reviews, and construct a dataset to support the development and evaluation of automated methods through which to enhance the accessibility of the biomedical literature. We conduct analyses of the various challenges in solving this task, including not only summarization of the key points but also explanation of background knowledge and simplification of professional language. We experiment with state-of-the-art summarization models as well as several data augmentation techniques, and evaluate their performance using both automated metrics and human assessment. Results indicate that automatically generated summaries produced using contemporary neural architectures can achieve promising quality and readability as compared with reference summaries developed for the lay public by experts (best ROUGE-L of 50.24 and Flesch-Kincaid readability score of 13.30). We also discuss the limitations of the current attempt, providing insights and directions for future work.
摘要:健康素养已成为做出适当健康决定和确保治疗结果的关键因素。但是,医学术语和该领域专业语言的复杂结构使健康信息尤其难以解释。因此,迫切需要自动化方法来增强生物医学文献对普通人群的可及性。可以将此问题定义为医疗专业人员和普通民众之间的一种翻译问题。在本文中,我们介绍了自动生成生物医学科学评论的外行语言摘要的新任务,并构建了一个数据集来支持自动化方法的开发和评估,从而增强了生物医学文献的可访问性。我们对解决此任务的各种挑战进行分析,不仅包括要点的概述,而且包括背景知识的说明和专业语言的简化。我们尝试使用最先进的汇总模型以及几种数据增强技术,并使用自动化指标和人工评估来评估其性能。结果表明,与专家为普通大众开发的参考摘要(最佳ROUGE-L为50.24和Flesch-Kincaid可读性评分为13.30)相比,使用当代神经体系结构自动生成的摘要可实现有希望的质量和可读性。我们还将讨论当前尝试的局限性,为以后的工作提供见识和指导。
Yue Guo, Wei Qiu, Yizhong Wang, Trevor Cohen
Abstract: Health literacy has emerged as a crucial factor in making appropriate health decisions and ensuring treatment outcomes. However, medical jargon and the complex structure of professional language in this domain make health information especially hard to interpret. Thus, there is an urgent unmet need for automated methods to enhance the accessibility of the biomedical literature to the general population. This problem can be framed as a type of translation problem between the language of healthcare professionals, and that of the general public. In this paper, we introduce the novel task of automated generation of lay language summaries of biomedical scientific reviews, and construct a dataset to support the development and evaluation of automated methods through which to enhance the accessibility of the biomedical literature. We conduct analyses of the various challenges in solving this task, including not only summarization of the key points but also explanation of background knowledge and simplification of professional language. We experiment with state-of-the-art summarization models as well as several data augmentation techniques, and evaluate their performance using both automated metrics and human assessment. Results indicate that automatically generated summaries produced using contemporary neural architectures can achieve promising quality and readability as compared with reference summaries developed for the lay public by experts (best ROUGE-L of 50.24 and Flesch-Kincaid readability score of 13.30). We also discuss the limitations of the current attempt, providing insights and directions for future work.
摘要:健康素养已成为做出适当健康决定和确保治疗结果的关键因素。但是,医学术语和该领域专业语言的复杂结构使健康信息尤其难以解释。因此,迫切需要自动化方法来增强生物医学文献对普通人群的可及性。可以将此问题定义为医疗专业人员和普通民众之间的一种翻译问题。在本文中,我们介绍了自动生成生物医学科学评论的外行语言摘要的新任务,并构建了一个数据集来支持自动化方法的开发和评估,从而增强了生物医学文献的可访问性。我们对解决此任务的各种挑战进行分析,不仅包括要点的概述,而且包括背景知识的说明和专业语言的简化。我们尝试使用最先进的汇总模型以及几种数据增强技术,并使用自动化指标和人工评估来评估其性能。结果表明,与专家为普通大众开发的参考摘要(最佳ROUGE-L为50.24和Flesch-Kincaid可读性评分为13.30)相比,使用当代神经体系结构自动生成的摘要可实现有希望的质量和可读性。我们还将讨论当前尝试的局限性,为以后的工作提供见识和指导。
8. Code Switching Language Model Using Monolingual Training Data [PDF] 返回目录
Asad Ullah, Tauseef Ahmed
Abstract: Training a code-switching (CS) language model using only monolingual data is still an ongoing research problem. In this paper, a CS language model is trained using only monolingual training data. As recurrent neural network (RNN) models are best suited for predicting sequential data. In this work, an RNN language model is trained using alternate batches from only monolingual English and Spanish data and the perplexity of the language model is computed. From the results, it is concluded that using alternate batches of monolingual data in training reduced the perplexity of a CS language model. The results were consistently improved using mean square error (MSE) in the output embeddings of RNN based language model. By combining both methods, perplexity is reduced from 299.63 to 80.38. The proposed methods were comparable to the language model fine tune with code-switch training data.
摘要:仅使用单语种数据训练代码转换(CS)语言模型仍然是一个持续存在的研究问题。 在本文中,仅使用单语种训练数据来训练CS语言模型。 由于递归神经网络(RNN)模型最适合预测顺序数据。 在这项工作中,仅从单语的英语和西班牙语数据中使用替代批次训练RNN语言模型,并计算语言模型的困惑度。 从结果可以得出结论,在训练中使用交替的单语数据批次可以减少CS语言模型的困惑。 在基于RNN的语言模型的输出嵌入中使用均方误差(MSE)来不断改善结果。 通过将两种方法结合起来,困惑度将从299.63降低到80.38。 所提出的方法可与带有代码转换训练数据的语言模型微调相媲美。
Asad Ullah, Tauseef Ahmed
Abstract: Training a code-switching (CS) language model using only monolingual data is still an ongoing research problem. In this paper, a CS language model is trained using only monolingual training data. As recurrent neural network (RNN) models are best suited for predicting sequential data. In this work, an RNN language model is trained using alternate batches from only monolingual English and Spanish data and the perplexity of the language model is computed. From the results, it is concluded that using alternate batches of monolingual data in training reduced the perplexity of a CS language model. The results were consistently improved using mean square error (MSE) in the output embeddings of RNN based language model. By combining both methods, perplexity is reduced from 299.63 to 80.38. The proposed methods were comparable to the language model fine tune with code-switch training data.
摘要:仅使用单语种数据训练代码转换(CS)语言模型仍然是一个持续存在的研究问题。 在本文中,仅使用单语种训练数据来训练CS语言模型。 由于递归神经网络(RNN)模型最适合预测顺序数据。 在这项工作中,仅从单语的英语和西班牙语数据中使用替代批次训练RNN语言模型,并计算语言模型的困惑度。 从结果可以得出结论,在训练中使用交替的单语数据批次可以减少CS语言模型的困惑。 在基于RNN的语言模型的输出嵌入中使用均方误差(MSE)来不断改善结果。 通过将两种方法结合起来,困惑度将从299.63降低到80.38。 所提出的方法可与带有代码转换训练数据的语言模型微调相媲美。
9. Future-Guided Incremental Transformer for Simultaneous Translation [PDF] 返回目录
Shaolei Zhang, Yang Feng, Liangyou Li
Abstract: Simultaneous translation (ST) starts translations synchronously while reading source sentences, and is used in many online scenarios. The previous wait-k policy is concise and achieved good results in ST. However, wait-k policy faces two weaknesses: low training speed caused by the recalculation of hidden states and lack of future source information to guide training. For the low training speed, we propose an incremental Transformer with an average embedding layer (AEL) to accelerate the speed of calculation of the hidden states during training. For future-guided training, we propose a conventional Transformer as the teacher of the incremental Transformer, and try to invisibly embed some future information in the model through knowledge distillation. We conducted experiments on Chinese-English and German-English simultaneous translation tasks and compared with the wait-k policy to evaluate the proposed method. Our method can effectively increase the training speed by about 28 times on average at different k and implicitly embed some predictive abilities in the model, achieving better translation quality than wait-k baseline.
摘要:同步翻译(ST)在阅读源句子时同步开始翻译,并且在许多在线场景中使用。先前的wait-k策略简洁明了,在ST中取得了良好的效果。但是,wait-k策略面临两个缺点:由于重新计算了隐藏状态而导致的培训速度较慢,并且缺乏用于指导培训的未来源信息。对于较低的训练速度,我们提出一种具有平均嵌入层(AEL)的增量式变压器,以加快训练期间隐藏状态的计算速度。对于将来的培训,我们建议使用传统的Transformer作为增量Transformer的老师,并尝试通过知识提炼无形地将一些将来的信息嵌入模型中。我们对汉英和德英同声翻译任务进行了实验,并与wait-k策略进行了比较以评估该方法。我们的方法可以有效地将不同k条件下的训练速度平均提高约28倍,并在模型中隐式嵌入一些预测能力,从而获得比wait-k基线更好的翻译质量。
Shaolei Zhang, Yang Feng, Liangyou Li
Abstract: Simultaneous translation (ST) starts translations synchronously while reading source sentences, and is used in many online scenarios. The previous wait-k policy is concise and achieved good results in ST. However, wait-k policy faces two weaknesses: low training speed caused by the recalculation of hidden states and lack of future source information to guide training. For the low training speed, we propose an incremental Transformer with an average embedding layer (AEL) to accelerate the speed of calculation of the hidden states during training. For future-guided training, we propose a conventional Transformer as the teacher of the incremental Transformer, and try to invisibly embed some future information in the model through knowledge distillation. We conducted experiments on Chinese-English and German-English simultaneous translation tasks and compared with the wait-k policy to evaluate the proposed method. Our method can effectively increase the training speed by about 28 times on average at different k and implicitly embed some predictive abilities in the model, achieving better translation quality than wait-k baseline.
摘要:同步翻译(ST)在阅读源句子时同步开始翻译,并且在许多在线场景中使用。先前的wait-k策略简洁明了,在ST中取得了良好的效果。但是,wait-k策略面临两个缺点:由于重新计算了隐藏状态而导致的培训速度较慢,并且缺乏用于指导培训的未来源信息。对于较低的训练速度,我们提出一种具有平均嵌入层(AEL)的增量式变压器,以加快训练期间隐藏状态的计算速度。对于将来的培训,我们建议使用传统的Transformer作为增量Transformer的老师,并尝试通过知识提炼无形地将一些将来的信息嵌入模型中。我们对汉英和德英同声翻译任务进行了实验,并与wait-k策略进行了比较以评估该方法。我们的方法可以有效地将不同k条件下的训练速度平均提高约28倍,并在模型中隐式嵌入一些预测能力,从而获得比wait-k基线更好的翻译质量。
10. TicketTalk: Toward human-level performance with end-to-end, transaction-based dialog systems [PDF] 返回目录
Bill Byrne, Karthik Krishnamoorthi, Saravanan Ganesh, Mihir Sanjay Kale
Abstract: We present a data-driven, end-to-end approach to transaction-based dialog systems that performs at near-human levels in terms of verbal response quality and factual grounding accuracy. We show that two essential components of the system produce these results: a sufficiently large and diverse, in-domain labeled dataset, and a neural network-based, pre-trained model that generates both verbal responses and API call predictions. In terms of data, we introduce TicketTalk, a movie ticketing dialog dataset with 23,789 annotated conversations. The movie ticketing conversations range from completely open-ended and unrestricted to more structured, both in terms of their knowledge base, discourse features, and number of turns. In qualitative human evaluations, model-generated responses trained on just 10,000 TicketTalk dialogs were rated to "make sense" 86.5 percent of the time, almost the same as human responses in the same contexts. Our simple, API-focused annotation schema results in a much easier labeling task making it faster and more cost effective. It is also the key component for being able to predict API calls accurately. We handle factual grounding by incorporating API calls in the training data, allowing our model to learn which actions to take and when. Trained on the same 10,000-dialog set, the model's API call predictions were rated to be correct 93.9 percent of the time in our evaluations, surpassing the ratings for the corresponding human labels. We show how API prediction and response generation scores improve as the dataset size incrementally increases from 5000 to 21,000 dialogs. Our analysis also clearly illustrates the benefits of pre-training. We are publicly releasing the TicketTalk dataset with this paper to facilitate future work on transaction-based dialogs.
摘要:我们提出了一种基于数据的,基于端到端的基于事务的对话系统方法,该方法在言语响应质量和实际的基础准确性方面都接近人类水平。我们展示了系统的两个基本组件会产生这些结果:足够大且多样化的域内标记数据集,以及基于神经网络的预训练模型,该模型会生成语言响应和API调用预测。在数据方面,我们引入了TicketTalk,这是一个电影票对话对话框数据集,具有23,789个带注释的对话。电影票务对话的基础,话语特征和回合次数都从完全开放和无限制到结构化。在定性的人类评估中,仅10,000次TicketTalk对话训练的模型生成的响应在86.5%的时间内被定为“有意义”,几乎与相同环境下的人类响应相同。我们简单的,以API为中心的注释架构可以使标记任务变得更加容易,从而使其更快,更具成本效益。它也是能够准确预测API调用的关键组件。我们通过将API调用合并到训练数据中来处理事实基础,从而使我们的模型能够了解要采取哪些行动以及何时采取行动。在相同的10,000个对话集上进行训练,在我们的评估中,该模型的API调用预测在93.9%的时间内被评为正确,超过了相应人类标签的评级。我们展示了API预测和响应生成得分如何随着数据集大小从5000个对话框递增到21,000个对话框而提高。我们的分析还清楚地说明了预培训的好处。我们将在本文中公开发布TicketTalk数据集,以促进将来基于事务的对话框的工作。
Bill Byrne, Karthik Krishnamoorthi, Saravanan Ganesh, Mihir Sanjay Kale
Abstract: We present a data-driven, end-to-end approach to transaction-based dialog systems that performs at near-human levels in terms of verbal response quality and factual grounding accuracy. We show that two essential components of the system produce these results: a sufficiently large and diverse, in-domain labeled dataset, and a neural network-based, pre-trained model that generates both verbal responses and API call predictions. In terms of data, we introduce TicketTalk, a movie ticketing dialog dataset with 23,789 annotated conversations. The movie ticketing conversations range from completely open-ended and unrestricted to more structured, both in terms of their knowledge base, discourse features, and number of turns. In qualitative human evaluations, model-generated responses trained on just 10,000 TicketTalk dialogs were rated to "make sense" 86.5 percent of the time, almost the same as human responses in the same contexts. Our simple, API-focused annotation schema results in a much easier labeling task making it faster and more cost effective. It is also the key component for being able to predict API calls accurately. We handle factual grounding by incorporating API calls in the training data, allowing our model to learn which actions to take and when. Trained on the same 10,000-dialog set, the model's API call predictions were rated to be correct 93.9 percent of the time in our evaluations, surpassing the ratings for the corresponding human labels. We show how API prediction and response generation scores improve as the dataset size incrementally increases from 5000 to 21,000 dialogs. Our analysis also clearly illustrates the benefits of pre-training. We are publicly releasing the TicketTalk dataset with this paper to facilitate future work on transaction-based dialogs.
摘要:我们提出了一种基于数据的,基于端到端的基于事务的对话系统方法,该方法在言语响应质量和实际的基础准确性方面都接近人类水平。我们展示了系统的两个基本组件会产生这些结果:足够大且多样化的域内标记数据集,以及基于神经网络的预训练模型,该模型会生成语言响应和API调用预测。在数据方面,我们引入了TicketTalk,这是一个电影票对话对话框数据集,具有23,789个带注释的对话。电影票务对话的基础,话语特征和回合次数都从完全开放和无限制到结构化。在定性的人类评估中,仅10,000次TicketTalk对话训练的模型生成的响应在86.5%的时间内被定为“有意义”,几乎与相同环境下的人类响应相同。我们简单的,以API为中心的注释架构可以使标记任务变得更加容易,从而使其更快,更具成本效益。它也是能够准确预测API调用的关键组件。我们通过将API调用合并到训练数据中来处理事实基础,从而使我们的模型能够了解要采取哪些行动以及何时采取行动。在相同的10,000个对话集上进行训练,在我们的评估中,该模型的API调用预测在93.9%的时间内被评为正确,超过了相应人类标签的评级。我们展示了API预测和响应生成得分如何随着数据集大小从5000个对话框递增到21,000个对话框而提高。我们的分析还清楚地说明了预培训的好处。我们将在本文中公开发布TicketTalk数据集,以促进将来基于事务的对话框的工作。
11. Simple-QE: Better Automatic Quality Estimation for Text Simplification [PDF] 返回目录
Reno Kriz, Marianna Apidianaki, Chris Callison-Burch
Abstract: Text simplification systems generate versions of texts that are easier to understand for a broader audience. The quality of simplified texts is generally estimated using metrics that compare to human references, which can be difficult to obtain. We propose Simple-QE, a BERT-based quality estimation (QE) model adapted from prior summarization QE work, and show that it correlates well with human quality judgments. Simple-QE does not require human references, which makes the model useful in a practical setting where users would need to be informed about the quality of generated simplifications. We also show that we can adapt this approach to accurately predict the complexity of human-written texts.
摘要:文本简化系统生成的文本版本对于更广泛的读者而言更易于理解。 简化文本的质量通常是使用与人类参考文献相比较的指标来估算的,这可能很难获得。 我们提出了Simple-QE,这是一种基于BERT的质量评估(QE)模型,它是根据先前的总结QE工作改编而成的,并表明它与人类质量判断具有很好的相关性。 Simple-QE不需要人工引用,这使得该模型在实际环境中很有用,在这种环境中,需要告知用户所生成简化的质量。 我们还表明,我们可以采用这种方法来准确预测人工编写文本的复杂性。
Reno Kriz, Marianna Apidianaki, Chris Callison-Burch
Abstract: Text simplification systems generate versions of texts that are easier to understand for a broader audience. The quality of simplified texts is generally estimated using metrics that compare to human references, which can be difficult to obtain. We propose Simple-QE, a BERT-based quality estimation (QE) model adapted from prior summarization QE work, and show that it correlates well with human quality judgments. Simple-QE does not require human references, which makes the model useful in a practical setting where users would need to be informed about the quality of generated simplifications. We also show that we can adapt this approach to accurately predict the complexity of human-written texts.
摘要:文本简化系统生成的文本版本对于更广泛的读者而言更易于理解。 简化文本的质量通常是使用与人类参考文献相比较的指标来估算的,这可能很难获得。 我们提出了Simple-QE,这是一种基于BERT的质量评估(QE)模型,它是根据先前的总结QE工作改编而成的,并表明它与人类质量判断具有很好的相关性。 Simple-QE不需要人工引用,这使得该模型在实际环境中很有用,在这种环境中,需要告知用户所生成简化的质量。 我们还表明,我们可以采用这种方法来准确预测人工编写文本的复杂性。
12. Multi-Head Self-Attention with Role-Guided Masks [PDF] 返回目录
Dongsheng Wang, Casper Hansen, Lucas Chaves Lima, Christian Hansen, Maria Maistro, Jakob Grue Simonsen, Christina Lioma
Abstract: The state of the art in learning meaningful semantic representations of words is the Transformer model and its attention mechanisms. Simply put, the attention mechanisms learn to attend to specific parts of the input dispensing recurrence and convolutions. While some of the learned attention heads have been found to play linguistically interpretable roles, they can be redundant or prone to errors. We propose a method to guide the attention heads towards roles identified in prior work as important. We do this by defining role-specific masks to constrain the heads to attend to specific parts of the input, such that different heads are designed to play different roles. Experiments on text classification and machine translation using 7 different datasets show that our method outperforms competitive attention-based, CNN, and RNN baselines.
摘要:Transformer模型及其注意机制是学习有意义的单词语义表示的最新技术。 简而言之,注意力机制学会了注意输入点胶重复和卷积的特定部分。 尽管已经发现一些学习的注意力领袖在语言上可以解释,但它们可能是多余的,也容易出错。 我们提出了一种方法来引导注意头转向在先前工作中确定的重要角色。 为此,我们定义了特定于角色的遮罩,以限制这些负责人参加输入的特定部分,以便设计不同的负责人扮演不同的角色。 使用7个不同的数据集进行文本分类和机器翻译的实验表明,我们的方法优于基于竞争注意,CNN和RNN的基线。
Dongsheng Wang, Casper Hansen, Lucas Chaves Lima, Christian Hansen, Maria Maistro, Jakob Grue Simonsen, Christina Lioma
Abstract: The state of the art in learning meaningful semantic representations of words is the Transformer model and its attention mechanisms. Simply put, the attention mechanisms learn to attend to specific parts of the input dispensing recurrence and convolutions. While some of the learned attention heads have been found to play linguistically interpretable roles, they can be redundant or prone to errors. We propose a method to guide the attention heads towards roles identified in prior work as important. We do this by defining role-specific masks to constrain the heads to attend to specific parts of the input, such that different heads are designed to play different roles. Experiments on text classification and machine translation using 7 different datasets show that our method outperforms competitive attention-based, CNN, and RNN baselines.
摘要:Transformer模型及其注意机制是学习有意义的单词语义表示的最新技术。 简而言之,注意力机制学会了注意输入点胶重复和卷积的特定部分。 尽管已经发现一些学习的注意力领袖在语言上可以解释,但它们可能是多余的,也容易出错。 我们提出了一种方法来引导注意头转向在先前工作中确定的重要角色。 为此,我们定义了特定于角色的遮罩,以限制这些负责人参加输入的特定部分,以便设计不同的负责人扮演不同的角色。 使用7个不同的数据集进行文本分类和机器翻译的实验表明,我们的方法优于基于竞争注意,CNN和RNN的基线。
13. ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces [PDF] 返回目录
Zecheng He, Srinivas Sunkara, Xiaoxue Zang, Ying Xu, Lijuan Liu, Nevan Wichers, Gabriel Schubiner, Ruby Lee, Jindong Chen
Abstract: As mobile devices are becoming ubiquitous, regularly interacting with a variety of user interfaces (UIs) is a common aspect of daily life for many people. To improve the accessibility of these devices and to enable their usage in a variety of settings, building models that can assist users and accomplish tasks through the UI is vitally important. However, there are several challenges to achieve this. First, UI components of similar appearance can have different functionalities, making understanding their function more important than just analyzing their appearance. Second, domain-specific features like Document Object Model (DOM) in web pages and View Hierarchy (VH) in mobile applications provide important signals about the semantics of UI elements, but these features are not in a natural language format. Third, owing to a large diversity in UIs and absence of standard DOM or VH representations, building a UI understanding model with high coverage requires large amounts of training data. Inspired by the success of pre-training based approaches in NLP for tackling a variety of problems in a data-efficient way, we introduce a new pre-trained UI representation model called ActionBert. Our methodology is designed to leverage visual, linguistic and domain-specific features in user interaction traces to pre-train generic feature representations of UIs and their components. Our key intuition is that user actions, e.g., a sequence of clicks on different UI components, reveals important information about their functionality. We evaluate the proposed model on a wide variety of downstream tasks, ranging from icon classification to UI component retrieval based on its natural language description. Experiments show that the proposed ActionBert model outperforms multi-modal baselines across all downstream tasks by up to 15.5%.
摘要:随着移动设备的普及,定期与各种用户界面(UI)交互是许多人日常生活中的常见方面。为了改善这些设备的可访问性并使其能够在各种设置中使用,构建可以帮助用户并通过UI完成任务的模型至关重要。但是,要实现这一目标存在若干挑战。首先,外观相似的UI组件可以具有不同的功能,这使了解其功能比分析其外观更为重要。其次,特定于域的功能(例如网页中的文档对象模型(DOM)和移动应用程序中的视图层次结构(VH))提供了有关UI元素语义的重要信号,但是这些功能不是自然语言格式。第三,由于用户界面的多样性很大,并且缺少标准的DOM或VH表示形式,因此建立具有高覆盖率的UI理解模型需要大量的训练数据。受NLP中基于预训练的方法成功以数据有效方式解决各种问题的启发,我们引入了一种称为ActionBert的新的预训练UI表示模型。我们的方法旨在利用用户交互跟踪中的视觉,语言和特定于域的功能来预训练UI及其组件的通用功能表示。我们的主要直觉是,用户操作(例如,对不同UI组件的一系列单击)揭示了有关其功能的重要信息。我们根据自然语言描述,从图标分类到UI组件检索,对各种下游任务评估了该模型。实验表明,所提出的ActionBert模型在所有下游任务中的多模式基线表现均高达15.5%。
Zecheng He, Srinivas Sunkara, Xiaoxue Zang, Ying Xu, Lijuan Liu, Nevan Wichers, Gabriel Schubiner, Ruby Lee, Jindong Chen
Abstract: As mobile devices are becoming ubiquitous, regularly interacting with a variety of user interfaces (UIs) is a common aspect of daily life for many people. To improve the accessibility of these devices and to enable their usage in a variety of settings, building models that can assist users and accomplish tasks through the UI is vitally important. However, there are several challenges to achieve this. First, UI components of similar appearance can have different functionalities, making understanding their function more important than just analyzing their appearance. Second, domain-specific features like Document Object Model (DOM) in web pages and View Hierarchy (VH) in mobile applications provide important signals about the semantics of UI elements, but these features are not in a natural language format. Third, owing to a large diversity in UIs and absence of standard DOM or VH representations, building a UI understanding model with high coverage requires large amounts of training data. Inspired by the success of pre-training based approaches in NLP for tackling a variety of problems in a data-efficient way, we introduce a new pre-trained UI representation model called ActionBert. Our methodology is designed to leverage visual, linguistic and domain-specific features in user interaction traces to pre-train generic feature representations of UIs and their components. Our key intuition is that user actions, e.g., a sequence of clicks on different UI components, reveals important information about their functionality. We evaluate the proposed model on a wide variety of downstream tasks, ranging from icon classification to UI component retrieval based on its natural language description. Experiments show that the proposed ActionBert model outperforms multi-modal baselines across all downstream tasks by up to 15.5%.
摘要:随着移动设备的普及,定期与各种用户界面(UI)交互是许多人日常生活中的常见方面。为了改善这些设备的可访问性并使其能够在各种设置中使用,构建可以帮助用户并通过UI完成任务的模型至关重要。但是,要实现这一目标存在若干挑战。首先,外观相似的UI组件可以具有不同的功能,这使了解其功能比分析其外观更为重要。其次,特定于域的功能(例如网页中的文档对象模型(DOM)和移动应用程序中的视图层次结构(VH))提供了有关UI元素语义的重要信号,但是这些功能不是自然语言格式。第三,由于用户界面的多样性很大,并且缺少标准的DOM或VH表示形式,因此建立具有高覆盖率的UI理解模型需要大量的训练数据。受NLP中基于预训练的方法成功以数据有效方式解决各种问题的启发,我们引入了一种称为ActionBert的新的预训练UI表示模型。我们的方法旨在利用用户交互跟踪中的视觉,语言和特定于域的功能来预训练UI及其组件的通用功能表示。我们的主要直觉是,用户操作(例如,对不同UI组件的一系列单击)揭示了有关其功能的重要信息。我们根据自然语言描述,从图标分类到UI组件检索,对各种下游任务评估了该模型。实验表明,所提出的ActionBert模型在所有下游任务中的多模式基线表现均高达15.5%。
14. Confronting Abusive Language Online: A Survey from the Ethical and Human Rights Perspective [PDF] 返回目录
Svetlana Kiritchenko, Isar Nejadgholi, Kathleen C. Fraser
Abstract: The pervasiveness of abusive content on the internet can lead to severe psychological and physical harm. Significant effort in Natural Language Processing (NLP) research has been devoted to addressing this problem through abusive content detection and related sub-areas, such as the detection of hate speech, toxicity, cyberbullying, etc. Although current technologies achieve high classification performance in research studies, it has been observed that the real-life application of this technology can cause unintended harms, such as the silencing of under-represented groups. We review a large body of NLP research on automatic abuse detection with a new focus on ethical challenges, organized around eight established ethical principles: privacy, accountability, safety and security, transparency and explainability, fairness and non-discrimination, human control of technology, professional responsibility, and promotion of human values. In many cases, these principles relate not only to situational ethical codes, which may be context-dependent, but are in fact connected to universal human rights, such as the right to privacy, freedom from discrimination, and freedom of expression. We highlight the need to examine the broad social impacts of this technology, and to bring ethical and human rights considerations to every stage of the application life-cycle, from task formulation and dataset design, to model training and evaluation, to application deployment. Guided by these principles, we identify several opportunities for rights-respecting, socio-technical solutions to detect and confront online abuse, including 'nudging', 'quarantining', value sensitive design, counter-narratives, style transfer, and AI-driven public education applications.
摘要:互联网上滥用内容的泛滥会导致严重的心理和身体伤害。在自然语言处理(NLP)研究中,人们一直致力于通过滥用内容检测和相关子区域来解决此问题,例如检测仇恨言论,毒性,网络欺凌等。尽管目前的技术在研究中具有很高的分类性能。研究表明,这种技术在现实生活中的应用可能会造成意想不到的伤害,例如使代表性不足的人群沉默。我们回顾了NLP关于自动滥用检测的大量研究,重点关注道德挑战,围绕八项已确立的道德原则进行了组织:隐私,责任,安全与保障,透明度和可解释性,公平与非歧视,对技术的人为控制,专业责任感,并提升人文价值。在许多情况下,这些原则不仅涉及可能与情境有关的情境道德守则,而且实际上与普遍人权有关,例如隐私权,不受歧视的自由和言论自由。我们强调需要检查该技术的广泛社会影响,并将伦理和人权考虑纳入应用程序生命周期的每个阶段,从任务制定和数据集设计到模型训练和评估,再到应用程序部署。在这些原则的指导下,我们确定了一些尊重权利,社会技术解决方案的机会,以发现并应对在线滥用行为,包括“滋生”,“隔离”,价值敏感的设计,反叙事,样式转换和AI驱动的公众教育应用。
Svetlana Kiritchenko, Isar Nejadgholi, Kathleen C. Fraser
Abstract: The pervasiveness of abusive content on the internet can lead to severe psychological and physical harm. Significant effort in Natural Language Processing (NLP) research has been devoted to addressing this problem through abusive content detection and related sub-areas, such as the detection of hate speech, toxicity, cyberbullying, etc. Although current technologies achieve high classification performance in research studies, it has been observed that the real-life application of this technology can cause unintended harms, such as the silencing of under-represented groups. We review a large body of NLP research on automatic abuse detection with a new focus on ethical challenges, organized around eight established ethical principles: privacy, accountability, safety and security, transparency and explainability, fairness and non-discrimination, human control of technology, professional responsibility, and promotion of human values. In many cases, these principles relate not only to situational ethical codes, which may be context-dependent, but are in fact connected to universal human rights, such as the right to privacy, freedom from discrimination, and freedom of expression. We highlight the need to examine the broad social impacts of this technology, and to bring ethical and human rights considerations to every stage of the application life-cycle, from task formulation and dataset design, to model training and evaluation, to application deployment. Guided by these principles, we identify several opportunities for rights-respecting, socio-technical solutions to detect and confront online abuse, including 'nudging', 'quarantining', value sensitive design, counter-narratives, style transfer, and AI-driven public education applications.
摘要:互联网上滥用内容的泛滥会导致严重的心理和身体伤害。在自然语言处理(NLP)研究中,人们一直致力于通过滥用内容检测和相关子区域来解决此问题,例如检测仇恨言论,毒性,网络欺凌等。尽管目前的技术在研究中具有很高的分类性能。研究表明,这种技术在现实生活中的应用可能会造成意想不到的伤害,例如使代表性不足的人群沉默。我们回顾了NLP关于自动滥用检测的大量研究,重点关注道德挑战,围绕八项已确立的道德原则进行了组织:隐私,责任,安全与保障,透明度和可解释性,公平与非歧视,对技术的人为控制,专业责任感,并提升人文价值。在许多情况下,这些原则不仅涉及可能与情境有关的情境道德守则,而且实际上与普遍人权有关,例如隐私权,不受歧视的自由和言论自由。我们强调需要检查该技术的广泛社会影响,并将伦理和人权考虑纳入应用程序生命周期的每个阶段,从任务制定和数据集设计到模型训练和评估,再到应用程序部署。在这些原则的指导下,我们确定了一些尊重权利,社会技术解决方案的机会,以发现并应对在线滥用行为,包括“滋生”,“隔离”,价值敏感的设计,反叙事,样式转换和AI驱动的公众教育应用。
15. Software Pipelining for Quantum Loop Programs [PDF] 返回目录
Jingzhe Guo, Mingsheng Ying
Abstract: We propose a method for performing software pipelining on quantum for-loop programs, exploiting parallelism in and across iterations. We redefine concepts that are useful in program optimization, including array aliasing, instruction dependency and resource conflict, this time in optimization of quantum programs. Using the redefined concepts, we present a software pipelining algorithm exploiting instruction-level parallelism in quantum loop programs. The optimization method is then evaluated on some test cases, including popular applications like QAOA, and compared with several baseline results. The evaluation results show that our approach outperforms loop optimizers exploiting only in-loop optimization chances by reducing total depth of the loop program to close to the optimal program depth obtained by full loop unrolling, while generating much smaller code in size. This is the first step towards optimization of a quantum program with such loop control flow as far as we know.
摘要:我们提出了一种在量子for循环程序上执行软件流水线的方法,该方法利用了迭代中以及迭代之间的并行性。我们重新定义了在程序优化中有用的概念,包括数组别名,指令依赖性和资源冲突,这一次是在量子程序的优化中。使用重新定义的概念,我们提出了一种在量子循环程序中利用指令级并行性的软件流水线算法。然后,在一些测试案例(包括诸如QAOA之类的流行应用)上评估优化方法,并将其与几个基准结果进行比较。评估结果表明,通过减少循环程序的总深度以接近通过全循环展开获得的最佳程序深度,我们的方法优于仅利用循环内优化机会的循环优化器,同时生成的代码要小得多。据我们所知,这是朝着利用这种循环控制流程优化量子程序的第一步。
Jingzhe Guo, Mingsheng Ying
Abstract: We propose a method for performing software pipelining on quantum for-loop programs, exploiting parallelism in and across iterations. We redefine concepts that are useful in program optimization, including array aliasing, instruction dependency and resource conflict, this time in optimization of quantum programs. Using the redefined concepts, we present a software pipelining algorithm exploiting instruction-level parallelism in quantum loop programs. The optimization method is then evaluated on some test cases, including popular applications like QAOA, and compared with several baseline results. The evaluation results show that our approach outperforms loop optimizers exploiting only in-loop optimization chances by reducing total depth of the loop program to close to the optimal program depth obtained by full loop unrolling, while generating much smaller code in size. This is the first step towards optimization of a quantum program with such loop control flow as far as we know.
摘要:我们提出了一种在量子for循环程序上执行软件流水线的方法,该方法利用了迭代中以及迭代之间的并行性。我们重新定义了在程序优化中有用的概念,包括数组别名,指令依赖性和资源冲突,这一次是在量子程序的优化中。使用重新定义的概念,我们提出了一种在量子循环程序中利用指令级并行性的软件流水线算法。然后,在一些测试案例(包括诸如QAOA之类的流行应用)上评估优化方法,并将其与几个基准结果进行比较。评估结果表明,通过减少循环程序的总深度以接近通过全循环展开获得的最佳程序深度,我们的方法优于仅利用循环内优化机会的循环优化器,同时生成的代码要小得多。据我们所知,这是朝着利用这种循环控制流程优化量子程序的第一步。
16. Seeing past words: Testing the cross-modal capabilities of pretrained V&L models [PDF] 返回目录
Letitia Parcalabescu, Albert Gatt, Anette Frank, Iacer Calixto
Abstract: We investigate the ability of general-purpose pretrained vision and language V&L models to perform reasoning in two tasks that require multimodal integration: (1) discriminating a correct image-sentence pair from an incorrect one, and (2) counting entities in an image. We evaluate three pretrained V&L models on these tasks: ViLBERT, ViLBERT 12-in-1 and LXMERT, in zero-shot and finetuned settings. Our results show that models solve task (1) very well, as expected, since all models use task (1) for pretraining. However, none of the pretrained V&L models are able to adequately solve task (2), our counting probe, and they cannot generalise to out-of-distribution quantities. Our investigations suggest that pretrained V&L representations are less successful than expected at integrating the two modalities. We propose a number of explanations for these findings: LXMERT's results on the image-sentence alignment task (and to a lesser extent those obtained by ViLBERT 12-in-1) indicate that the model may exhibit catastrophic forgetting. As for our results on the counting probe, we find evidence that all models are impacted by dataset bias, and also fail to individuate entities in the visual input.
摘要:我们研究了通用的预训练视觉和语言V&L模型在需要多模式集成的两项任务中执行推理的能力:(1)从错误的句子中区分出正确的图像句子对,以及(2)在图片。我们针对零任务和微调设置在这些任务上评估了三个预训练的V&L模型:ViLBERT,ViLBERT 12合1和LXMERT。我们的结果表明,由于所有模型都使用任务(1)进行预训练,因此模型可以很好地解决任务(1)。但是,没有一个预训练的V&L模型能够充分解决任务(2)(我们的计数探针),并且不能将其推广到分布外的数量。我们的研究表明,在整合这两种模式时,预训练的V&L表示不如预期的成功。对于这些发现,我们提出了许多解释:LXMERT在图像句子对齐任务上的结果(在较小程度上是通过ViLBERT 12合1获得的结果)表明该模型可能表现出灾难性的遗忘。至于我们在计数探针上的结果,我们发现证据表明所有模型都受到数据集偏差的影响,并且也无法在可视输入中区分实体。
Letitia Parcalabescu, Albert Gatt, Anette Frank, Iacer Calixto
Abstract: We investigate the ability of general-purpose pretrained vision and language V&L models to perform reasoning in two tasks that require multimodal integration: (1) discriminating a correct image-sentence pair from an incorrect one, and (2) counting entities in an image. We evaluate three pretrained V&L models on these tasks: ViLBERT, ViLBERT 12-in-1 and LXMERT, in zero-shot and finetuned settings. Our results show that models solve task (1) very well, as expected, since all models use task (1) for pretraining. However, none of the pretrained V&L models are able to adequately solve task (2), our counting probe, and they cannot generalise to out-of-distribution quantities. Our investigations suggest that pretrained V&L representations are less successful than expected at integrating the two modalities. We propose a number of explanations for these findings: LXMERT's results on the image-sentence alignment task (and to a lesser extent those obtained by ViLBERT 12-in-1) indicate that the model may exhibit catastrophic forgetting. As for our results on the counting probe, we find evidence that all models are impacted by dataset bias, and also fail to individuate entities in the visual input.
摘要:我们研究了通用的预训练视觉和语言V&L模型在需要多模式集成的两项任务中执行推理的能力:(1)从错误的句子中区分出正确的图像句子对,以及(2)在图片。我们针对零任务和微调设置在这些任务上评估了三个预训练的V&L模型:ViLBERT,ViLBERT 12合1和LXMERT。我们的结果表明,由于所有模型都使用任务(1)进行预训练,因此模型可以很好地解决任务(1)。但是,没有一个预训练的V&L模型能够充分解决任务(2)(我们的计数探针),并且不能将其推广到分布外的数量。我们的研究表明,在整合这两种模式时,预训练的V&L表示不如预期的成功。对于这些发现,我们提出了许多解释:LXMERT在图像句子对齐任务上的结果(在较小程度上是通过ViLBERT 12合1获得的结果)表明该模型可能表现出灾难性的遗忘。至于我们在计数探针上的结果,我们发现证据表明所有模型都受到数据集偏差的影响,并且也无法在可视输入中区分实体。
17. Video Influencers: Unboxing the Mystique [PDF] 返回目录
Prashant Rajaram, Puneet Manchanda
Abstract: Influencer marketing is being used increasingly as a tool to reach customers because of the growing popularity of social media stars who primarily reach their audience(s) via custom videos. Despite the rapid growth in influencer marketing, there has been little research on the design and effectiveness of influencer videos. Using publicly available data on YouTube influencer videos, we implement novel interpretable deep learning architectures, supported by transfer learning, to identify significant relationships between advertising content in videos (across text, audio, and images) and video views, interaction rates and sentiment. By avoiding ex-ante feature engineering and instead using ex-post interpretation, our approach avoids making a trade-off between interpretability and predictive ability. We filter out relationships that are affected by confounding factors unassociated with an increase in attention to video elements, thus facilitating the generation of plausible causal relationships between video elements and marketing outcomes which can be tested in the field. A key finding is that brand mentions in the first 30 seconds of a video are on average associated with a significant increase in attention to the brand but a significant decrease in sentiment expressed towards the video. We illustrate the learnings from our approach for both influencers and brands.
摘要:由于社交媒体明星主要通过自定义视频吸引受众,因此越来越多的网红营销被用作吸引客户的工具。尽管影响者营销迅速增长,但对影响者视频的设计和有效性的研究很少。利用YouTube影响者视频上的公开数据,我们实施了新颖的,可解释的深度学习架构,并在迁移学习的支持下,确定了视频中的广告内容(跨文本,音频和图像)与视频观看次数,互动率和情感之间的重要关系。通过避免事前特征工程,而使用事后解释,我们的方法避免了在可解释性和预测能力之间进行权衡。我们筛选出不受混杂因素影响的关系,这些混杂因素与对视频元素的关注度增加无关,从而有助于在视频元素和营销成果之间建立合理的因果关系,从而可以在现场进行测试。一个关键发现是,在视频的前30秒中,品牌提及平均与对品牌的关注度显着提高但对视频表达的情感显着下降有关。我们说明了从我们的方法中为有影响力的人和品牌所获得的经验。
Prashant Rajaram, Puneet Manchanda
Abstract: Influencer marketing is being used increasingly as a tool to reach customers because of the growing popularity of social media stars who primarily reach their audience(s) via custom videos. Despite the rapid growth in influencer marketing, there has been little research on the design and effectiveness of influencer videos. Using publicly available data on YouTube influencer videos, we implement novel interpretable deep learning architectures, supported by transfer learning, to identify significant relationships between advertising content in videos (across text, audio, and images) and video views, interaction rates and sentiment. By avoiding ex-ante feature engineering and instead using ex-post interpretation, our approach avoids making a trade-off between interpretability and predictive ability. We filter out relationships that are affected by confounding factors unassociated with an increase in attention to video elements, thus facilitating the generation of plausible causal relationships between video elements and marketing outcomes which can be tested in the field. A key finding is that brand mentions in the first 30 seconds of a video are on average associated with a significant increase in attention to the brand but a significant decrease in sentiment expressed towards the video. We illustrate the learnings from our approach for both influencers and brands.
摘要:由于社交媒体明星主要通过自定义视频吸引受众,因此越来越多的网红营销被用作吸引客户的工具。尽管影响者营销迅速增长,但对影响者视频的设计和有效性的研究很少。利用YouTube影响者视频上的公开数据,我们实施了新颖的,可解释的深度学习架构,并在迁移学习的支持下,确定了视频中的广告内容(跨文本,音频和图像)与视频观看次数,互动率和情感之间的重要关系。通过避免事前特征工程,而使用事后解释,我们的方法避免了在可解释性和预测能力之间进行权衡。我们筛选出不受混杂因素影响的关系,这些混杂因素与对视频元素的关注度增加无关,从而有助于在视频元素和营销成果之间建立合理的因果关系,从而可以在现场进行测试。一个关键发现是,在视频的前30秒中,品牌提及平均与对品牌的关注度显着提高但对视频表达的情感显着下降有关。我们说明了从我们的方法中为有影响力的人和品牌所获得的经验。
注:中文为机器翻译结果!封面为论文标题词云图!