0%

【NLP】 2015-2020 Knowledge Distillation 知识蒸馏相关论文整理

目录

1. Towards Cross-Modality Medical Image Segmentation with Online Mutual Knowledge Distillation, AAAI 2020 [PDF] 摘要
2. DeGAN: Data-Enriching GAN for Retrieving Representative Samples from a Trained Classifier, AAAI 2020 [PDF] 摘要
3. Few Shot Network Compression via Cross Distillation, AAAI 2020 [PDF] 摘要
4. Scalable Attentive Sentence Pair Modeling via Distilled Sentence Embedding, AAAI 2020 [PDF] 摘要
5. Online Knowledge Distillation with Diverse Peers, AAAI 2020 [PDF] 摘要
6. Distilling Portable Generative Adversarial Networks for Image Translation, AAAI 2020 [PDF] 摘要
7. Adversarially Robust Distillation, AAAI 2020 [PDF] 摘要
8. Towards Oracle Knowledge Distillation with Neural Architecture Search, AAAI 2020 [PDF] 摘要
9. Improved Knowledge Distillation via Teacher Assistant, AAAI 2020 [PDF] 摘要
10. Light Multi-Segment Activation for Model Compression, AAAI 2020 [PDF] 摘要
11. Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers, AAAI 2020 [PDF] 摘要
12. Knowledge Distillation from Internal Representations, AAAI 2020 [PDF] 摘要
13. Go From the General to the Particular: Multi-Domain Translation with Domain Transformation Networks, AAAI 2020 [PDF] 摘要
14. Acquiring Knowledge from Pre-Trained Model to Neural Machine Translation, AAAI 2020 [PDF] 摘要
15. Distilling Knowledge from Well-Informed Soft Labels for Neural Relation Extraction, AAAI 2020 [PDF] 摘要
16. Ultrafast Video Attention Prediction with Coupled Knowledge Distillation, AAAI 2020 [PDF] 摘要
17. Look One and More: Distilling Hybrid Order Relational Knowledge for Cross-Resolution Image Recognition, AAAI 2020 [PDF] 摘要
18. Uncertainty-Aware Multi-Shot Knowledge Distillation for Image-Based Object Re-Identification, AAAI 2020 [PDF] 摘要
19. Hierarchical Knowledge Squeezed Adversarial Network Compression, AAAI 2020 [PDF] 摘要
20. Exploit and Replace: An Asymmetrical Two-Stream Architecture for Versatile Light Field Saliency Detection, AAAI 2020 [PDF] 摘要
21. An Efficient Framework for Dense Video Captioning, AAAI 2020 [PDF] 摘要
22. A Study of Non-autoregressive Model for Sequence Generation, ACL 2020 [PDF] 摘要
23. Improving Non-autoregressive Neural Machine Translation with Monolingual Data, ACL 2020 [PDF] 摘要
24. XtremeDistil: Multi-stage Distillation for Massive Multilingual Models, ACL 2020 [PDF] 摘要
25. Structure-Level Knowledge Distillation For Multilingual Sequence Labeling, ACL 2020 [PDF] 摘要
26. Knowledge Distillation for Multilingual Unsupervised Neural Machine Translation, ACL 2020 [PDF] 摘要
27. SimulSpeech: End-to-End Simultaneous Speech to Text Translation, ACL 2020 [PDF] 摘要
28. Improving Event Detection via Open-domain Trigger Knowledge, ACL 2020 [PDF] 摘要
29. TextBrewer: An Open-Source Knowledge Distillation Toolkit for Natural Language Processing, ACL 2020 [PDF] 摘要
30. The Microsoft Toolkit of Multi-Task Deep Neural Networks for Natural Language Understanding, ACL 2020 [PDF] 摘要
31. Improving Autoregressive NMT with Non-Autoregressive Model, ACL 2020 [PDF] 摘要
32. End-to-End Speech-Translation with Knowledge Distillation: FBK@IWSLT2020, ACL 2020 [PDF] 摘要
33. CASIA’s System for IWSLT 2020 Open Domain Translation, ACL 2020 [PDF] 摘要
34. Xiaomi’s Submissions for IWSLT 2020 Open Domain Translation Task, ACL 2020 [PDF] 摘要
35. Balancing Cost and Benefit with Tied-Multi Transformers, ACL 2020 [PDF] 摘要
36. Distill, Adapt, Distill: Training Small, In-Domain Models for Neural Machine Translation, ACL 2020 [PDF] 摘要
37. The NiuTrans System for WNGT 2020 Efficiency Task, ACL 2020 [PDF] 摘要
38. Exploring the Limits of Simple Learners in Knowledge Distillation for Document Classification with DocBERT, ACL 2020 [PDF] 摘要
39. Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation, EMNLP 2020 [PDF] 摘要
40. Adversarial Self-Supervised Data Free Distillation for Text Classification, EMNLP 2020 [PDF] 摘要
41. Language Model Prior for Low-Resource Neural Machine Translation, EMNLP 2020 [PDF] 摘要
42. Lifelong Language Knowledge Distillation, EMNLP 2020 [PDF] 摘要
43. BERT-of-Theseus: Compressing BERT by Progressive Module Replacing, EMNLP 2020 [PDF] 摘要
44. Bridging the Gap between Prior and Posterior Knowledge Selection for Knowledge-Grounded Dialogue Generation, EMNLP 2020 [PDF] 摘要
45. Autoregressive Knowledge Distillation through Imitation Learning, EMNLP 2020 [PDF] 摘要
46. TernaryBERT: Distillation-aware Ultra-low Bit BERT, EMNLP 2020 [PDF] 摘要
47. Improving Neural Topic Models Using Knowledge Distillation, EMNLP 2020 [PDF] 摘要
48. FedED: Federated Learning via Ensemble Distillation for Medical Relation Extraction, EMNLP 2020 [PDF] 摘要
49. Scalable Zero-shot Entity Linking with Dense Entity Retrieval, EMNLP 2020 [PDF] 摘要
50. Contrastive Distillation on Intermediate Representations for Language Model Compression, EMNLP 2020 [PDF] 摘要
51. Distilling Structured Knowledge for Text-Based Relational Reasoning, EMNLP 2020 [PDF] 摘要
52. Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers, EMNLP 2020 [PDF] 摘要
53. Mimic and Conquer: Heterogeneous Tree Structure Distillation for Syntactic NLP, EMNLP 2020 [PDF] 摘要
54. Using the Past Knowledge to Improve Sentiment Classification, EMNLP 2020 [PDF] 摘要
55. Fast End-to-end Coreference Resolution for Korean, EMNLP 2020 [PDF] 摘要
56. Improving Word Embedding Factorization for Compression using Distilled Nonlinear Neural Decomposition, EMNLP 2020 [PDF] 摘要
57. DiPair: Fast and Accurate Distillation for Trillion-ScaleText Matching and Pair Modeling, EMNLP 2020 [PDF] 摘要
58. General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference, EMNLP 2020 [PDF] 摘要
59. Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning, EMNLP 2020 [PDF] 摘要
60. TinyBERT: Distilling BERT for Natural Language Understanding, EMNLP 2020 [PDF] 摘要
61. Knowledge Consistency between Neural Networks and Beyond, ICLR 2020 [PDF] 摘要
62. Understanding Knowledge Distillation in Non-autoregressive Machine Translation, ICLR 2020 [PDF] 摘要
63. Contrastive Representation Distillation, ICLR 2020 [PDF] 摘要
64. Neural Epitome Search for Architecture-Agnostic Network Compression, ICLR 2020 [PDF] 摘要
65. Lifelong Zero-Shot Learning, IJCAI 2020 [PDF] 摘要
66. AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural Architecture Search, IJCAI 2020 [PDF] 摘要
67. Dual Policy Distillation, IJCAI 2020 [PDF] 摘要
68. P-KDGAN: Progressive Knowledge Distillation with GANs for One-class Novelty Detection, IJCAI 2020 [PDF] 摘要
69. An Iterative Multi-Source Mutual Knowledge Transfer Framework for Machine Reading Comprehension, IJCAI 2020 [PDF] 摘要
70. UniTrans : Unifying Model Transfer and Data Transfer for Cross-Lingual Named Entity Recognition with Unlabeled Data, IJCAI 2020 [PDF] 摘要
71. Private Model Compression via Knowledge Distillation, AAAI 2019 [PDF] 摘要
72. Knowledge Distillation with Adversarial Samples Supporting Decision Boundary, AAAI 2019 [PDF] 摘要
73. Data-Distortion Guided Self-Distillation for Deep Neural Networks, AAAI 2019 [PDF] 摘要
74. Exploiting the Ground-Truth: An Adversarial Imitation Based Knowledge Distillation Approach for Event Detection, AAAI 2019 [PDF] 摘要
75. Scalable Syntax-Aware Language Models Using Knowledge Distillation, ACL 2019 [PDF] 摘要
76. BAM! Born-Again Multi-Task Networks for Natural Language Understanding, ACL 2019 [PDF] 摘要
77. PANLP at MEDIQA 2019: Pre-trained Language Models, Transfer Learning and Knowledge Distillation, ACL 2019 [PDF] 摘要
78. GTCOM Neural Machine Translation Systems for WMT19, ACL 2019 [PDF] 摘要
79. The NiuTrans Machine Translation Systems for WMT19, ACL 2019 [PDF] 摘要
80. Baidu Neural Machine Translation Systems for WMT19, ACL 2019 [PDF] 摘要
81. Microsoft Research Asia’s Systems for WMT19, ACL 2019 [PDF] 摘要
82. Iterative Dual Domain Adaptation for Neural Machine Translation, EMNLP 2019 [PDF] 摘要
83. Patient Knowledge Distillation for BERT Model Compression, EMNLP 2019 [PDF] 摘要
84. Weakly Supervised Cross-lingual Semantic Relation Classification via Knowledge Distillation, EMNLP 2019 [PDF] 摘要
85. Generation-Distillation for Efficient Natural Language Understanding in Low-Data Settings, EMNLP 2019 [PDF] 摘要
86. Natural Language Generation for Effective Knowledge Distillation, EMNLP 2019 [PDF] 摘要
87. Multilingual Neural Machine Translation with Knowledge Distillation, ICLR 2019 [PDF] 摘要
88. A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation, ICLR 2019 [PDF] 摘要
89. Zero-Shot Knowledge Distillation in Deep Networks, ICML 2019 [PDF] 摘要
90. Towards Understanding Knowledge Distillation, ICML 2019 [PDF] 摘要
91. Pedestrian Attribute Recognition by Joint Visual-semantic Reasoning and Knowledge Distillation, IJCAI 2019 [PDF] 摘要
92. RDPD: Rich Data Helps Poor Data via Imitation, IJCAI 2019 [PDF] 摘要
93. Online Distilling from Checkpoints for Neural Machine Translation, NAACL 2019 [PDF] 摘要
94. On Knowledge distillation from complex networks for response prediction, NAACL 2019 [PDF] 摘要
95. Network Pruning via Transformable Architecture Search, NeurIPS 2019 [PDF] 摘要
96. Channel Gating Neural Networks, NeurIPS 2019 [PDF] 摘要
97. Positive-Unlabeled Compression on the Cloud, NeurIPS 2019 [PDF] 摘要
98. Knowledge Extraction with No Observable Data, NeurIPS 2019 [PDF] 摘要
99. SCAN: A Scalable Neural Networks Framework Towards Compact and Efficient Models, NeurIPS 2019 [PDF] 摘要
100. When does label smoothing help?, NeurIPS 2019 [PDF] 摘要
101. Random Path Selection for Continual Learning, NeurIPS 2019 [PDF] 摘要
102. Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems, NeurIPS 2019 [PDF] 摘要
103. Wasserstein Distance Guided Representation Learning for Domain Adaptation, AAAI 2018 [PDF] 摘要
104. Cooperative Denoising for Distantly Supervised Relation Extraction, COLING 2018 [PDF] 摘要
105. One-Shot Relational Learning for Knowledge Graphs, EMNLP 2018 [PDF] 摘要
106. Attention-Guided Answer Distillation for Machine Reading Comprehension, EMNLP 2018 [PDF] 摘要
107. Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy, ICLR 2018 [PDF] 摘要
108. Non-Autoregressive Neural Machine Translation, ICLR 2018 [PDF] 摘要
109. Born-Again Neural Networks, ICML 2018 [PDF] 摘要
110. StrassenNets: Deep Learning with a Multiplication Budget, ICML 2018 [PDF] 摘要
111. Progressive Blockwise Knowledge Distillation for Neural Network Acceleration, IJCAI 2018 [PDF] 摘要
112. KDGAN: Knowledge Distillation with Generative Adversarial Networks, NeurIPS 2018 [PDF] 摘要
113. Collaborative Learning for Deep Neural Networks, NeurIPS 2018 [PDF] 摘要
114. Knowledge Distillation by On-the-Fly Native Ensemble, NeurIPS 2018 [PDF] 摘要
115. Learning to Specialize with Knowledge Distillation for Visual Question Answering, NeurIPS 2018 [PDF] 摘要
116. WebChild 2.0 : Fine-Grained Commonsense Knowledge Distillation, ACL 2017 [PDF] 摘要
117. A Joint Sequential and Relational Model for Frame-Semantic Parsing, EMNLP 2017 [PDF] 摘要
118. Knowledge Distillation for Bilingual Dictionary Induction, EMNLP 2017 [PDF] 摘要
119. Learning Efficient Object Detection Models with Knowledge Distillation, NeurIPS 2017 [PDF] 摘要
120. Sequence-Level Knowledge Distillation, EMNLP 2016 [PDF] 摘要
121. FitNets: Hints for Thin Deep Nets, ICLR 2015 [PDF] 摘要

摘要

1. Towards Cross-Modality Medical Image Segmentation with Online Mutual Knowledge Distillation [PDF] 返回目录
  AAAI 2020. AAAI Technical Track: Applications
  Kang Li, Lequan Yu, Shujun Wang, Pheng-Ann Heng
The success of deep convolutional neural networks is partially attributed to the massive amount of annotated training data. However, in practice, medical data annotations are usually expensive and time-consuming to be obtained. Considering multi-modality data with the same anatomic structures are widely available in clinic routine, in this paper, we aim to exploit the prior knowledge (e.g., shape priors) learned from one modality (aka., assistant modality) to improve the segmentation performance on another modality (aka., target modality) to make up annotation scarcity. To alleviate the learning difficulties caused by modality-specific appearance discrepancy, we first present an Image Alignment Module (IAM) to narrow the appearance gap between assistant and target modality data. We then propose a novel Mutual Knowledge Distillation (MKD) scheme to thoroughly exploit the modality-shared knowledge to facilitate the target-modality segmentation. To be specific, we formulate our framework as an integration of two individual segmentors. Each segmentor not only explicitly extracts one modality knowledge from corresponding annotations, but also implicitly explores another modality knowledge from its counterpart in mutual-guided manner. The ensemble of two segmentors would further integrate the knowledge from both modalities and generate reliable segmentation results on target modality. Experimental results on the public multi-class cardiac segmentation data, i.e., MM-WHS 2017, show that our method achieves large improvements on CT segmentation by utilizing additional MRI data and outperforms other state-of-the-art multi-modality learning methods.

2. DeGAN: Data-Enriching GAN for Retrieving Representative Samples from a Trained Classifier [PDF] 返回目录
  AAAI 2020. AAAI Technical Track: Machine Learning
  Sravanti Addepalli, Gaurav Kumar Nayak, Anirban Chakraborty, Venkatesh Babu Radhakrishnan
In this era of digital information explosion, an abundance of data from numerous modalities is being generated as well as archived everyday. However, most problems associated with training Deep Neural Networks still revolve around lack of data that is rich enough for a given task. Data is required not only for training an initial model, but also for future learning tasks such as Model Compression and Incremental Learning. A diverse dataset may be used for training an initial model, but it may not be feasible to store it throughout the product life cycle due to data privacy issues or memory constraints. We propose to bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a given trained network. We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples from a trained classifier, using a novel Data-enriching GAN (DeGAN) framework. We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance for the tasks of Data-free Knowledge Distillation and Incremental Learning on benchmark datasets. We further demonstrate that our proposed framework can enrich any data, even from unrelated domains, to make it more useful for the future learning tasks of a given network.

3. Few Shot Network Compression via Cross Distillation [PDF] 返回目录
  AAAI 2020. AAAI Technical Track: Machine Learning
  Haoli Bai, Jiaxiang Wu, Irwin King, Michael R. Lyu
Model compression has been widely adopted to obtain light-weighted deep neural networks. Most prevalent methods, however, require fine-tuning with sufficient training data to ensure accuracy, which could be challenged by privacy and security issues. As a compromise between privacy and performance, in this paper we investigate few shot network compression: given few samples per class, how can we effectively compress the network with negligible performance drop? The core challenge of few shot network compression lies in high estimation errors from the original network during inference, since the compressed network can easily over-fits on the few training instances. The estimation errors could propagate and accumulate layer-wisely and finally deteriorate the network output. To address the problem, we propose cross distillation, a novel layer-wise knowledge distillation approach. By interweaving hidden layers of teacher and student network, layer-wisely accumulated estimation errors can be effectively reduced. The proposed method offers a general framework compatible with prevalent network compression techniques such as pruning. Extensive experiments n benchmark datasets demonstrate that cross distillation can significantly improve the student network's accuracy when only a few training instances are available.

4. Scalable Attentive Sentence Pair Modeling via Distilled Sentence Embedding [PDF] 返回目录
  AAAI 2020. AAAI Technical Track: Machine Learning
  Oren Barkan, Noam Razin, Itzik Malkiel, Ori Katz, Avi Caciularu, Noam Koenigstein
Recent state-of-the-art natural language understanding models, such as BERT and XLNet, score a pair of sentences (A and B) using multiple cross-attention operations – a process in which each word in sentence A attends to all words in sentence B and vice versa. As a result, computing the similarity between a query sentence and a set of candidate sentences, requires the propagation of all query-candidate sentence-pairs throughout a stack of cross-attention layers. This exhaustive process becomes computationally prohibitive when the number of candidate sentences is large. In contrast, sentence embedding techniques learn a sentence-to-vector mapping and compute the similarity between the sentence vectors via simple elementary operations. In this paper, we introduce Distilled Sentence Embedding (DSE) – a model that is based on knowledge distillation from cross-attentive models, focusing on sentence-pair tasks. The outline of DSE is as follows: Given a cross-attentive teacher model (e.g. a fine-tuned BERT), we train a sentence embedding based student model to reconstruct the sentence-pair scores obtained by the teacher model. We empirically demonstrate the effectiveness of DSE on five GLUE sentence-pair tasks. DSE significantly outperforms several ELMO variants and other sentence embedding methods, while accelerating computation of the query-candidate sentence-pairs similarities by several orders of magnitude, with an average relative degradation of 4.6% compared to BERT. Furthermore, we show that DSE produces sentence embeddings that reach state-of-the-art performance on universal sentence representation benchmarks. Our code is made publicly available at https://github.com/microsoft/Distilled-Sentence-Embedding.

5. Online Knowledge Distillation with Diverse Peers [PDF] 返回目录
  AAAI 2020. AAAI Technical Track: Machine Learning
  Defang Chen, Jian-Ping Mei, Can Wang, Yan Feng, Chun Chen
Distillation is an effective knowledge-transfer technique that uses predicted distributions of a powerful teacher model as soft targets to train a less-parameterized student model. A pre-trained high capacity teacher, however, is not always available. Recently proposed online variants use the aggregated intermediate predictions of multiple student models as targets to train each student model. Although group-derived targets give a good recipe for teacher-free distillation, group members are homogenized quickly with simple aggregation functions, leading to early saturated solutions. In this work, we propose Online Knowledge Distillation with Diverse peers (OKDDip), which performs two-level distillation during training with multiple auxiliary peers and one group leader. In the first-level distillation, each auxiliary peer holds an individual set of aggregation weights generated with an attention-based mechanism to derive its own targets from predictions of other auxiliary peers. Learning from distinct target distributions helps to boost peer diversity for effectiveness of group-based distillation. The second-level distillation is performed to transfer the knowledge in the ensemble of auxiliary peers further to the group leader, i.e., the model used for inference. Experimental results show that the proposed framework consistently gives better performance than state-of-the-art approaches without sacrificing training or inference complexity, demonstrating the effectiveness of the proposed two-level distillation framework.

6. Distilling Portable Generative Adversarial Networks for Image Translation [PDF] 返回目录
  AAAI 2020. AAAI Technical Track: Machine Learning
  Hanting Chen, Yunhe Wang, Han Shu, Changyuan Wen, Chunjing Xu, Boxin Shi, Chao Xu, Chang Xu
Despite Generative Adversarial Networks (GANs) have been widely used in various image-to-image translation tasks, they can be hardly applied on mobile devices due to their heavy computation and storage cost. Traditional network compression methods focus on visually recognition tasks, but never deal with generation tasks. Inspired by knowledge distillation, a student generator of fewer parameters is trained by inheriting the low-level and high-level information from the original heavy teacher generator. To promote the capability of student generator, we include a student discriminator to measure the distances between real images, and images generated by student and teacher generators. An adversarial learning process is therefore established to optimize student generator and student discriminator. Qualitative and quantitative analysis by conducting experiments on benchmark datasets demonstrate that the proposed method can learn portable generative models with strong performance.

7. Adversarially Robust Distillation [PDF] 返回目录
  AAAI 2020. AAAI Technical Track: Machine Learning
  Micah Goldblum, Liam Fowl, Soheil Feizi, Tom Goldstein
Knowledge distillation is effective for producing small, high-performance neural networks for classification, but these small networks are vulnerable to adversarial attacks. This paper studies how adversarial robustness transfers from teacher to student during knowledge distillation. We find that a large amount of robustness may be inherited by the student even when distilled on only clean images. Second, we introduce Adversarially Robust Distillation (ARD) for distilling robustness onto student networks. In addition to producing small models with high test accuracy like conventional distillation, ARD also passes the superior robustness of large networks onto the student. In our experiments, we find that ARD student models decisively outperform adversarially trained networks of identical architecture in terms of robust accuracy, surpassing state-of-the-art methods on standard robustness benchmarks. Finally, we adapt recent fast adversarial training methods to ARD for accelerated robust distillation.

8. Towards Oracle Knowledge Distillation with Neural Architecture Search [PDF] 返回目录
  AAAI 2020. AAAI Technical Track: Machine Learning
  Minsoo Kang, Jonghwan Mun, Bohyung Han
We present a novel framework of knowledge distillation that is capable of learning powerful and efficient student models from ensemble teacher networks. Our approach addresses the inherent model capacity issue between teacher and student and aims to maximize benefit from teacher models during distillation by reducing their capacity gap. Specifically, we employ a neural architecture search technique to augment useful structures and operations, where the searched network is appropriate for knowledge distillation towards student models and free from sacrificing its performance by fixing the network capacity. We also introduce an oracle knowledge distillation loss to facilitate model search and distillation using an ensemble-based teacher model, where a student network is learned to imitate oracle performance of the teacher. We perform extensive experiments on the image classification datasets—CIFAR-100 and TinyImageNet—using various networks. We also show that searching for a new student model is effective in both accuracy and memory size and that the searched models often outperform their teacher models thanks to neural architecture search with oracle knowledge distillation.

9. Improved Knowledge Distillation via Teacher Assistant [PDF] 返回目录
  AAAI 2020. AAAI Technical Track: Machine Learning
  Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, Hassan Ghasemzadeh
Despite the fact that deep neural networks are powerful models and achieve appealing results on many tasks, they are too large to be deployed on edge devices like smartphones or embedded sensor nodes. There have been efforts to compress these networks, and a popular method is knowledge distillation, where a large (teacher) pre-trained network is used to train a smaller (student) network. However, in this paper, we show that the student network performance degrades when the gap between student and teacher is large. Given a fixed student network, one cannot employ an arbitrarily large teacher, or in other words, a teacher can effectively transfer its knowledge to students up to a certain size, not smaller. To alleviate this shortcoming, we introduce multi-step knowledge distillation, which employs an intermediate-sized network (teacher assistant) to bridge the gap between the student and the teacher. Moreover, we study the effect of teacher assistant size and extend the framework to multi-step distillation. Theoretical analysis and extensive experiments on CIFAR-10,100 and ImageNet datasets and on CNN and ResNet architectures substantiate the effectiveness of our proposed approach.

10. Light Multi-Segment Activation for Model Compression [PDF] 返回目录
  AAAI 2020. AAAI Technical Track: Machine Learning
  Zhenhui Xu, Guolin Ke, Jia Zhang, Jiang Bian, Tie-Yan Liu
Model compression has become necessary when applying neural networks (NN) into many real application tasks that can accept slightly-reduced model accuracy but with strict tolerance to model complexity. Recently, Knowledge Distillation, which distills the knowledge from well-trained and highly complex teacher model into a compact student model, has been widely used for model compression. However, under the strict requirement on the resource cost, it is quite challenging to make student model achieve comparable performance with the teacher one, essentially due to the drastically-reduced expressiveness ability of the compact student model. Inspired by the nature of the expressiveness ability in NN, we propose to use multi-segment activation, which can significantly improve the expressiveness ability with very little cost, in the compact student model. Specifically, we propose a highly efficient multi-segment activation, called Light Multi-segment Activation (LMA), which can rapidly produce multiple linear regions with very few parameters by leveraging the statistical information. With using LMA, the compact student model is capable of achieving much better performance effectively and efficiently, than the ReLU-equipped one with same model complexity. Furthermore, the proposed method is compatible with other model compression techniques, such as quantization, which means they can be used jointly for better compression performance. Experiments on state-of-the-art NN architectures over the real-world tasks demonstrate the effectiveness and extensibility of the LMA.

11. Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers [PDF] 返回目录
  AAAI 2020. AAAI Technical Track: Machine Learning
  Ya Zhao, Rui Xu, Xinchao Wang, Peng Hou, Haihong Tang, Mingli Song
Lip reading has witnessed unparalleled development in recent years thanks to deep learning and the availability of large-scale datasets. Despite the encouraging results achieved, the performance of lip reading, unfortunately, remains inferior to the one of its counterpart speech recognition, due to the ambiguous nature of its actuations that makes it challenging to extract discriminant features from the lip movement videos. In this paper, we propose a new method, termed as Lip by Speech (LIBS), of which the goal is to strengthen lip reading by learning from speech recognizers. The rationale behind our approach is that the features extracted from speech recognizers may provide complementary and discriminant clues, which are formidable to be obtained from the subtle movements of the lips, and consequently facilitate the training of lip readers. This is achieved, specifically, by distilling multi-granularity knowledge from speech recognizers to lip readers. To conduct this cross-modal knowledge distillation, we utilize an efficacious alignment scheme to handle the inconsistent lengths of the audios and videos, as well as an innovative filtering strategy to refine the speech recognizer's prediction. The proposed method achieves the new state-of-the-art performance on the CMLR and LRS2 datasets, outperforming the baseline by a margin of 7.66% and 2.75% in character error rate, respectively.

12. Knowledge Distillation from Internal Representations [PDF] 返回目录
  AAAI 2020. AAAI Technical Track: Natural Language Processing
  Gustavo Aguilar, Yuan Ling, Yu Zhang, Benjamin Yao, Xing Fan, Chenlei Guo
Knowledge distillation is typically conducted by training a small model (the student) to mimic a large and cumbersome model (the teacher). The idea is to compress the knowledge from the teacher by using its output probabilities as soft-labels to optimize the student. However, when the teacher is considerably large, there is no guarantee that the internal knowledge of the teacher will be transferred into the student; even if the student closely matches the soft-labels, its internal representations may be considerably different. This internal mismatch can undermine the generalization capabilities originally intended to be transferred from the teacher to the student. In this paper, we propose to distill the internal representations of a large model such as BERT into a simplified version of it. We formulate two ways to distill such representations and various algorithms to conduct the distillation. We experiment with datasets from the GLUE benchmark and consistently show that adding knowledge distillation from internal representations is a more powerful method than only using soft-label distillation.

13. Go From the General to the Particular: Multi-Domain Translation with Domain Transformation Networks [PDF] 返回目录
  AAAI 2020. AAAI Technical Track: Natural Language Processing
  Yong Wang, Longyue Wang, Shuming Shi, Victor O. K. Li, Zhaopeng Tu
The key challenge of multi-domain translation lies in simultaneously encoding both the general knowledge shared across domains and the particular knowledge distinctive to each domain in a unified model. Previous work shows that the standard neural machine translation (NMT) model, trained on mixed-domain data, generally captures the general knowledge, but misses the domain-specific knowledge. In response to this problem, we augment NMT model with additional domain transformation networks to transform the general representations to domain-specific representations, which are subsequently fed to the NMT decoder. To guarantee the knowledge transformation, we also propose two complementary supervision signals by leveraging the power of knowledge distillation and adversarial learning. Experimental results on several language pairs, covering both balanced and unbalanced multi-domain translation, demonstrate the effectiveness and universality of the proposed approach. Encouragingly, the proposed unified model achieves comparable results with the fine-tuning approach that requires multiple models to preserve the particular knowledge. Further analyses reveal that the domain transformation networks successfully capture the domain-specific knowledge as expected.1

14. Acquiring Knowledge from Pre-Trained Model to Neural Machine Translation [PDF] 返回目录
  AAAI 2020. AAAI Technical Track: Natural Language Processing
  Rongxiang Weng, Heng Yu, Shujian Huang, Shanbo Cheng, Weihua Luo
Pre-training and fine-tuning have achieved great success in natural language process field. The standard paradigm of exploiting them includes two steps: first, pre-training a model, e.g. BERT, with a large scale unlabeled monolingual data. Then, fine-tuning the pre-trained model with labeled data from downstream tasks. However, in neural machine translation (NMT), we address the problem that the training objective of the bilingual task is far different from the monolingual pre-trained model. This gap leads that only using fine-tuning in NMT can not fully utilize prior language knowledge. In this paper, we propose an Apt framework for acquiring knowledge from pre-trained model to NMT. The proposed approach includes two modules: 1). a dynamic fusion mechanism to fuse task-specific features adapted from general knowledge into NMT network, 2). a knowledge distillation paradigm to learn language knowledge continuously during the NMT training process. The proposed approach could integrate suitable knowledge from pre-trained models to improve the NMT. Experimental results on WMT English to German, German to English and Chinese to English machine translation tasks show that our model outperforms strong baselines and the fine-tuning counterparts.

15. Distilling Knowledge from Well-Informed Soft Labels for Neural Relation Extraction [PDF] 返回目录
  AAAI 2020. AAAI Technical Track: Natural Language Processing
  Zhenyu Zhang, Xiaobo Shu, Bowen Yu, Tingwen Liu, Jiapeng Zhao, Quangang Li, Li Guo
Extracting relations from plain text is an important task with wide application. Most existing methods formulate it as a supervised problem and utilize one-hot hard labels as the sole target in training, neglecting the rich semantic information among relations. In this paper, we aim to explore the supervision with soft labels in relation extraction, which makes it possible to integrate prior knowledge. Specifically, a bipartite graph is first devised to discover type constraints between entities and relations based on the entire corpus. Then, we combine such type constraints with neural networks to achieve a knowledgeable model. Furthermore, this model is regarded as teacher to generate well-informed soft labels and guide the optimization of a student network via knowledge distillation. Besides, a multi-aspect attention mechanism is introduced to help student mine latent information from text. In this way, the enhanced student inherits the dark knowledge (e.g., type constraints and relevance among relations) from teacher, and directly serves the testing scenarios without any extra constraints. We conduct extensive experiments on the TACRED and SemEval datasets, the experimental results justify the effectiveness of our approach.

16. Ultrafast Video Attention Prediction with Coupled Knowledge Distillation [PDF] 返回目录
  AAAI 2020. AAAI Technical Track: Vision
  Kui Fu, Peipei Shi, Yafei Song, Shiming Ge, Xiangju Lu, Jia Li
Large convolutional neural network models have recently demonstrated impressive performance on video attention prediction. Conventionally, these models are with intensive computation and large memory. To address these issues, we design an extremely light-weight network with ultrafast speed, named UVA-Net. The network is constructed based on depth-wise convolutions and takes low-resolution images as input. However, this straight-forward acceleration method will decrease performance dramatically. To this end, we propose a coupled knowledge distillation strategy to augment and train the network effectively. With this strategy, the model can further automatically discover and emphasize implicit useful cues contained in the data. Both spatial and temporal knowledge learned by the high-resolution complex teacher networks also can be distilled and transferred into the proposed low-resolution light-weight spatiotemporal network. Experimental results show that the performance of our model is comparable to 11 state-of-the-art models in video attention prediction, while it costs only 0.68 MB memory footprint, runs about 10,106 FPS on GPU and 404 FPS on CPU, which is 206 times faster than previous models.

17. Look One and More: Distilling Hybrid Order Relational Knowledge for Cross-Resolution Image Recognition [PDF] 返回目录
  AAAI 2020. AAAI Technical Track: Vision
  Shiming Ge, Kangkai Zhang, Haolin Liu, Yingying Hua, Shengwei Zhao, Xin Jin, Hao Wen
In spite of great success in many image recognition tasks achieved by recent deep models, directly applying them to recognize low-resolution images may suffer from low accuracy due to the missing of informative details during resolution degradation. However, these images are still recognizable for subjects who are familiar with the corresponding high-resolution ones. Inspired by that, we propose a teacher-student learning approach to facilitate low-resolution image recognition via hybrid order relational knowledge distillation. The approach refers to three streams: the teacher stream is pretrained to recognize high-resolution images in high accuracy, the student stream is learned to identify low-resolution images by mimicking the teacher's behaviors, and the extra assistant stream is introduced as bridge to help knowledge transfer across the teacher to the student. To extract sufficient knowledge for reducing the loss in accuracy, the learning of student is supervised with multiple losses, which preserves the similarities in various order relational structures. In this way, the capability of recovering missing details of familiar low-resolution images can be effectively enhanced, leading to a better knowledge transfer. Extensive experiments on metric learning, low-resolution image classification and low-resolution face recognition tasks show the effectiveness of our approach, while taking reduced models.

18. Uncertainty-Aware Multi-Shot Knowledge Distillation for Image-Based Object Re-Identification [PDF] 返回目录
  AAAI 2020. AAAI Technical Track: Vision
  Xin Jin, Cuiling Lan, Wenjun Zeng, Zhibo Chen
Object re-identification (re-id) aims to identify a specific object across times or camera views, with the person re-id and vehicle re-id as the most widely studied applications. Re-id is challenging because of the variations in viewpoints, (human) poses, and occlusions. Multi-shots of the same object can cover diverse viewpoints/poses and thus provide more comprehensive information. In this paper, we propose exploiting the multi-shots of the same identity to guide the feature learning of each individual image. Specifically, we design an Uncertainty-aware Multi-shot Teacher-Student (UMTS) Network. It consists of a teacher network (T-net) that learns the comprehensive features from multiple images of the same object, and a student network (S-net) that takes a single image as input. In particular, we take into account the data dependent heteroscedastic uncertainty for effectively transferring the knowledge from the T-net to S-net. To the best of our knowledge, we are the first to make use of multi-shots of an object in a teacher-student learning manner for effectively boosting the single image based re-id. We validate the effectiveness of our approach on the popular vehicle re-id and person re-id datasets. In inference, the S-net alone significantly outperforms the baselines and achieves the state-of-the-art performance.

19. Hierarchical Knowledge Squeezed Adversarial Network Compression [PDF] 返回目录
  AAAI 2020. AAAI Technical Track: Vision
  Peng Li, Chang Shu, Yuan Xie, Yan Qu, Hui Kong
Deep network compression has been achieved notable progress via knowledge distillation, where a teacher-student learning manner is adopted by using predetermined loss. Recently, more focuses have been transferred to employ the adversarial training to minimize the discrepancy between distributions of output from two networks. However, they always emphasize on result-oriented learning while neglecting the scheme of process-oriented learning, leading to the loss of rich information contained in the whole network pipeline. Whereas in other (non GAN-based) process-oriented methods, the knowledge have usually been transferred in a redundant manner. Observing that, the small network can not perfectly mimic a large one due to the huge gap of network scale, we propose a knowledge transfer method, involving effective intermediate supervision, under the adversarial training framework to learn the student network. Different from the other intermediate supervision methods, we design the knowledge representation in a compact form by introducing a task-driven attention mechanism. Meanwhile, to improve the representation capability of the attention-based method, a hierarchical structure is utilized so that powerful but highly squeezed knowledge is realized and the knowledge from teacher network could accommodate the size of student network. Extensive experimental results on three typical benchmark datasets, i.e., CIFAR-10, CIFAR-100, and ImageNet, demonstrate that our method achieves highly superior performances against state-of-the-art methods.

20. Exploit and Replace: An Asymmetrical Two-Stream Architecture for Versatile Light Field Saliency Detection [PDF] 返回目录
  AAAI 2020. AAAI Technical Track: Vision
  Yongri Piao, Zhengkun Rong, Miao Zhang, Huchuan Lu
Light field saliency detection is becoming of increasing interest in recent years due to the significant improvements in challenging scenes by using abundant light field cues. However, high dimension of light field data poses computation-intensive and memory-intensive challenges, and light field data access is far less ubiquitous as RGB data. These may severely impede practical applications of light field saliency detection. In this paper, we introduce an asymmetrical two-stream architecture inspired by knowledge distillation to confront these challenges. First, we design a teacher network to learn to exploit focal slices for higher requirements on desktop computers and meanwhile transfer comprehensive focusness knowledge to the student network. Our teacher network is achieved relying on two tailor-made modules, namely multi-focusness recruiting module (MFRM) and multi-focusness screening module (MFSM), respectively. Second, we propose two distillation schemes to train a student network towards memory and computation efficiency while ensuring the performance. The proposed distillation schemes ensure better absorption of focusness knowledge and enable the student to replace the focal slices with a single RGB image in an user-friendly way. We conduct the experiments on three benchmark datasets and demonstrate that our teacher network achieves state-of-the-arts performance and student network (ResNet18) achieves Top-1 accuracies on HFUT-LFSD dataset and Top-4 on DUT-LFSD, which tremendously minimizes the model size by 56% and boosts the Frame Per Second (FPS) by 159%, compared with the best performing method.

21. An Efficient Framework for Dense Video Captioning [PDF] 返回目录
  AAAI 2020. AAAI Technical Track: Vision
  Maitreya Suin, A. N. Rajagopalan
Dense video captioning is an extremely challenging task since an accurate and faithful description of events in a video requires a holistic knowledge of the video contents as well as contextual reasoning of individual events. Most existing approaches handle this problem by first proposing event boundaries from a video and then captioning on a subset of the proposals. Generation of dense temporal annotations and corresponding captions from long videos can be dramatically source consuming. In this paper, we focus on the task of generating a dense description of temporally untrimmed videos and aim to significantly reduce the computational cost by processing fewer frames while maintaining accuracy. Existing video captioning methods sample frames with a predefined frequency over the entire video or use all the frames. Instead, we propose a deep reinforcement-based approach which enables an agent to describe multiple events in a video by watching a portion of the frames. The agent needs to watch more frames when it is processing an informative part of the video, and skip frames when there is redundancy. The agent is trained using actor-critic algorithm, where the actor determines the frames to be watched from a video and the critic assesses the optimality of the decisions taken by the actor. Such an efficient frame selection simplifies the event proposal task considerably. This has the added effect of reducing the occurrence of unwanted proposals. The encoded state representation of the frame selection agent is further utilized for guiding event proposal and caption generation tasks. We also leverage the idea of knowledge distillation to improve the accuracy. We conduct extensive evaluations on ActivityNet captions dataset to validate our method.

22. A Study of Non-autoregressive Model for Sequence Generation [PDF] 返回目录
  ACL 2020.
  Yi Ren, Jinglin Liu, Xu Tan, Zhou Zhao, Sheng Zhao, Tie-Yan Liu
Non-autoregressive (NAR) models generate all the tokens of a sequence in parallel, resulting in faster generation speed compared to their autoregressive (AR) counterparts but at the cost of lower accuracy. Different techniques including knowledge distillation and source-target alignment have been proposed to bridge the gap between AR and NAR models in various tasks such as neural machine translation (NMT), automatic speech recognition (ASR), and text to speech (TTS). With the help of those techniques, NAR models can catch up with the accuracy of AR models in some tasks but not in some others. In this work, we conduct a study to understand the difficulty of NAR sequence generation and try to answer: (1) Why NAR models can catch up with AR models in some tasks but not all? (2) Why techniques like knowledge distillation and source-target alignment can help NAR models. Since the main difference between AR and NAR models is that NAR models do not use dependency among target tokens while AR models do, intuitively the difficulty of NAR sequence generation heavily depends on the strongness of dependency among target tokens. To quantify such dependency, we propose an analysis model called CoMMA to characterize the difficulty of different NAR sequence generation tasks. We have several interesting findings: 1) Among the NMT, ASR and TTS tasks, ASR has the most target-token dependency while TTS has the least. 2) Knowledge distillation reduces the target-token dependency in target sequence and thus improves the accuracy of NAR models. 3) Source-target alignment constraint encourages dependency of a target token on source tokens and thus eases the training of NAR models.

23. Improving Non-autoregressive Neural Machine Translation with Monolingual Data [PDF] 返回目录
  ACL 2020.
  Jiawei Zhou, Phillip Keung
Non-autoregressive (NAR) neural machine translation is usually done via knowledge distillation from an autoregressive (AR) model. Under this framework, we leverage large monolingual corpora to improve the NAR model’s performance, with the goal of transferring the AR model’s generalization ability while preventing overfitting. On top of a strong NAR baseline, our experimental results on the WMT14 En-De and WMT16 En-Ro news translation tasks confirm that monolingual data augmentation consistently improves the performance of the NAR model to approach the teacher AR model’s performance, yields comparable or better results than the best non-iterative NAR methods in the literature and helps reduce overfitting in the training process.

24. XtremeDistil: Multi-stage Distillation for Massive Multilingual Models [PDF] 返回目录
  ACL 2020.
  Subhabrata Mukherjee, Ahmed Hassan Awadallah
Deep and large pre-trained language models are the state-of-the-art for various natural language processing tasks. However, the huge size of these models could be a deterrent to using them in practice. Some recent works use knowledge distillation to compress these huge models into shallow ones. In this work we study knowledge distillation with a focus on multilingual Named Entity Recognition (NER). In particular, we study several distillation strategies and propose a stage-wise optimization scheme leveraging teacher internal representations, that is agnostic of teacher architecture, and show that it outperforms strategies employed in prior works. Additionally, we investigate the role of several factors like the amount of unlabeled data, annotation resources, model architecture and inference latency to name a few. We show that our approach leads to massive compression of teacher models like mBERT by upto 35x in terms of parameters and 51x in terms of latency for batch inference while retaining 95% of its F1-score for NER over 41 languages.

25. Structure-Level Knowledge Distillation For Multilingual Sequence Labeling [PDF] 返回目录
  ACL 2020.
  Xinyu Wang, Yong Jiang, Nguyen Bach, Tao Wang, Fei Huang, Kewei Tu
Multilingual sequence labeling is a task of predicting label sequences using a single unified model for multiple languages. Compared with relying on multiple monolingual models, using a multilingual model has the benefit of a smaller model size, easier in online serving, and generalizability to low-resource languages. However, current multilingual models still underperform individual monolingual models significantly due to model capacity limitations. In this paper, we propose to reduce the gap between monolingual models and the unified multilingual model by distilling the structural knowledge of several monolingual models (teachers) to the unified multilingual model (student). We propose two novel KD methods based on structure-level information: (1) approximately minimizes the distance between the student’s and the teachers’ structure-level probability distributions, (2) aggregates the structure-level knowledge to local distributions and minimizes the distance between two local probability distributions. Our experiments on 4 multilingual tasks with 25 datasets show that our approaches outperform several strong baselines and have stronger zero-shot generalizability than both the baseline model and teacher models.

26. Knowledge Distillation for Multilingual Unsupervised Neural Machine Translation [PDF] 返回目录
  ACL 2020.
  Haipeng Sun, Rui Wang, Kehai Chen, Masao Utiyama, Eiichiro Sumita, Tiejun Zhao
Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs. However, it can only translate between a single language pair and cannot produce translation results for multiple language pairs at the same time. That is, research on multilingual UNMT has been limited. In this paper, we empirically introduce a simple method to translate between thirteen languages using a single encoder and a single decoder, making use of multilingual data to improve UNMT for all language pairs. On the basis of the empirical findings, we propose two knowledge distillation methods to further enhance multilingual UNMT performance. Our experiments on a dataset with English translated to and from twelve other languages (including three language families and six language branches) show remarkable results, surpassing strong unsupervised individual baselines while achieving promising performance between non-English language pairs in zero-shot translation scenarios and alleviating poor performance in low-resource language pairs.

27. SimulSpeech: End-to-End Simultaneous Speech to Text Translation [PDF] 返回目录
  ACL 2020.
  Yi Ren, Jinglin Liu, Xu Tan, Chen Zhang, Tao Qin, Zhou Zhao, Tie-Yan Liu
In this work, we develop SimulSpeech, an end-to-end simultaneous speech to text translation system which translates speech in source language to text in target language concurrently. SimulSpeech consists of a speech encoder, a speech segmenter and a text decoder, where 1) the segmenter builds upon the encoder and leverages a connectionist temporal classification (CTC) loss to split the input streaming speech in real time, 2) the encoder-decoder attention adopts a wait-k strategy for simultaneous translation. SimulSpeech is more challenging than previous cascaded systems (with simultaneous automatic speech recognition (ASR) and simultaneous neural machine translation (NMT)). We introduce two novel knowledge distillation methods to ensure the performance: 1) Attention-level knowledge distillation transfers the knowledge from the multiplication of the attention matrices of simultaneous NMT and ASR models to help the training of the attention mechanism in SimulSpeech; 2) Data-level knowledge distillation transfers the knowledge from the full-sentence NMT model and also reduces the complexity of data distribution to help on the optimization of SimulSpeech. Experiments on MuST-C English-Spanish and English-German spoken language translation datasets show that SimulSpeech achieves reasonable BLEU scores and lower delay compared to full-sentence end-to-end speech to text translation (without simultaneous translation), and better performance than the two-stage cascaded simultaneous translation model in terms of BLEU scores and translation delay.

28. Improving Event Detection via Open-domain Trigger Knowledge [PDF] 返回目录
  ACL 2020.
  Meihan Tong, Bin Xu, Shuai Wang, Yixin Cao, Lei Hou, Juanzi Li, Jun Xie
Event Detection (ED) is a fundamental task in automatically structuring texts. Due to the small scale of training data, previous methods perform poorly on unseen/sparsely labeled trigger words and are prone to overfitting densely labeled trigger words. To address the issue, we propose a novel Enrichment Knowledge Distillation (EKD) model to leverage external open-domain trigger knowledge to reduce the in-built biases to frequent trigger words in annotations. Experiments on benchmark ACE2005 show that our model outperforms nine strong baselines, is especially effective for unseen/sparsely labeled trigger words. The source code is released on https://github.com/shuaiwa16/ekd.git.

29. TextBrewer: An Open-Source Knowledge Distillation Toolkit for Natural Language Processing [PDF] 返回目录
  ACL 2020. System Demonstrations
  Ziqing Yang, Yiming Cui, Zhipeng Chen, Wanxiang Che, Ting Liu, Shijin Wang, Guoping Hu
In this paper, we introduce TextBrewer, an open-source knowledge distillation toolkit designed for natural language processing. It works with different neural network models and supports various kinds of supervised learning tasks, such as text classification, reading comprehension, sequence labeling. TextBrewer provides a simple and uniform workflow that enables quick setting up of distillation experiments with highly flexible configurations. It offers a set of predefined distillation methods and can be extended with custom code. As a case study, we use TextBrewer to distill BERT on several typical NLP tasks. With simple configurations, we achieve results that are comparable with or even higher than the public distilled BERT models with similar numbers of parameters.

30. The Microsoft Toolkit of Multi-Task Deep Neural Networks for Natural Language Understanding [PDF] 返回目录
  ACL 2020. System Demonstrations
  Xiaodong Liu, Yu Wang, Jianshu Ji, Hao Cheng, Xueyun Zhu, Emmanuel Awa, Pengcheng He, Weizhu Chen, Hoifung Poon, Guihong Cao, Jianfeng Gao
We present MT-DNN, an open-source natural language understanding (NLU) toolkit that makes it easy for researchers and developers to train customized deep learning models. Built upon PyTorch and Transformers, MT-DNN is designed to facilitate rapid customization for a broad spectrum of NLU tasks, using a variety of objectives (classification, regression, structured prediction) and text encoders (e.g., RNNs, BERT, RoBERTa, UniLM). A unique feature of MT-DNN is its built-in support for robust and transferable learning using the adversarial multi-task learning paradigm. To enable efficient production deployment, MT-DNN supports multi-task knowledge distillation, which can substantially compress a deep neural model without significant performance drop. We demonstrate the effectiveness of MT-DNN on a wide range of NLU applications across general and biomedical domains. The software and pre-trained models will be publicly available at https://github.com/namisan/mt-dnn.

31. Improving Autoregressive NMT with Non-Autoregressive Model [PDF] 返回目录
  ACL 2020. the First Workshop on Automatic Simultaneous Translation
  Long Zhou, Jiajun Zhang, Chengqing Zong
Autoregressive neural machine translation (NMT) models are often used to teach non-autoregressive models via knowledge distillation. However, there are few studies on improving the quality of autoregressive translation (AT) using non-autoregressive translation (NAT). In this work, we propose a novel Encoder-NAD-AD framework for NMT, aiming at boosting AT with global information produced by NAT model. Specifically, under the semantic guidance of source-side context captured by the encoder, the non-autoregressive decoder (NAD) first learns to generate target-side hidden state sequence in parallel. Then the autoregressive decoder (AD) performs translation from left to right, conditioned on source-side and target-side hidden states. Since AD has global information generated by low-latency NAD, it is more likely to produce a better translation with less time delay. Experiments on WMT14 En-De, WMT16 En-Ro, and IWSLT14 De-En translation tasks demonstrate that our framework achieves significant improvements with only 8% speed degeneration over the autoregressive NMT.

32. End-to-End Speech-Translation with Knowledge Distillation: FBK@IWSLT2020 [PDF] 返回目录
  ACL 2020. the 17th International Conference on Spoken Language Translation
  Marco Gaido, Mattia A. Di Gangi, Matteo Negri, Marco Turchi
This paper describes FBK’s participation in the IWSLT 2020 offline speech translation (ST) task. The task evaluates systems’ ability to translate English TED talks audio into German texts. The test talks are provided in two versions: one contains the data already segmented with automatic tools and the other is the raw data without any segmentation. Participants can decide whether to work on custom segmentation or not. We used the provided segmentation. Our system is an end-to-end model based on an adaptation of the Transformer for speech data. Its training process is the main focus of this paper and it is based on: i) transfer learning (ASR pretraining and knowledge distillation), ii) data augmentation (SpecAugment, time stretch and synthetic data), iii)combining synthetic and real data marked as different domains, and iv) multi-task learning using the CTC loss. Finally, after the training with word-level knowledge distillation is complete, our ST models are fine-tuned using label smoothed cross entropy. Our best model scored 29 BLEU on the MuST-CEn-De test set, which is an excellent result compared to recent papers, and 23.7 BLEU on the same data segmented with VAD, showing the need for researching solutions addressing this specific data condition.

33. CASIA’s System for IWSLT 2020 Open Domain Translation [PDF] 返回目录
  ACL 2020. the 17th International Conference on Spoken Language Translation
  Qian Wang, Yuchen Liu, Cong Ma, Yu Lu, Yining Wang, Long Zhou, Yang Zhao, Jiajun Zhang, Chengqing Zong
This paper describes the CASIA’s system for the IWSLT 2020 open domain translation task. This year we participate in both Chinese→Japanese and Japanese→Chinese translation tasks. Our system is neural machine translation system based on Transformer model. We augment the training data with knowledge distillation and back translation to improve the translation performance. Domain data classification and weighted domain model ensemble are introduced to generate the final translation result. We compare and analyze the performance on development data with different model settings and different data processing techniques.

34. Xiaomi’s Submissions for IWSLT 2020 Open Domain Translation Task [PDF] 返回目录
  ACL 2020. the 17th International Conference on Spoken Language Translation
  Yuhui Sun, Mengxue Guo, Xiang Li, Jianwei Cui, Bin Wang
This paper describes the Xiaomi’s submissions to the IWSLT20 shared open domain translation task for Chinese<->Japanese language pair. We explore different model ensembling strategies based on recent Transformer variants. We also further strengthen our systems via some effective techniques, such as data filtering, data selection, tagged back translation, domain adaptation, knowledge distillation, and re-ranking. Our resulting Chinese->Japanese primary system ranked second in terms of character-level BLEU score among all submissions. Our resulting Japanese->Chinese primary system also achieved a competitive performance.

35. Balancing Cost and Benefit with Tied-Multi Transformers [PDF] 返回目录
  ACL 2020. the Fourth Workshop on Neural Generation and Translation
  Raj Dabre, Raphael Rubino, Atsushi Fujita
We propose a novel procedure for training multiple Transformers with tied parameters which compresses multiple models into one enabling the dynamic choice of the number of encoder and decoder layers during decoding. In training an encoder-decoder model, typically, the output of the last layer of the N-layer encoder is fed to the M-layer decoder, and the output of the last decoder layer is used to compute loss. Instead, our method computes a single loss consisting of NxM losses, where each loss is computed from the output of one of the M decoder layers connected to one of the N encoder layers. Such a model subsumes NxM models with different number of encoder and decoder layers, and can be used for decoding with fewer than the maximum number of encoder and decoder layers. Given our flexible tied model, we also address to a-priori selection of the number of encoder and decoder layers for faster decoding, and explore recurrent stacking of layers and knowledge distillation for model compression. We present a cost-benefit analysis of applying the proposed approaches for neural machine translation and show that they reduce decoding costs while preserving translation quality.

36. Distill, Adapt, Distill: Training Small, In-Domain Models for Neural Machine Translation [PDF] 返回目录
  ACL 2020. the Fourth Workshop on Neural Generation and Translation
  Mitchell Gordon, Kevin Duh
We explore best practices for training small, memory efficient machine translation models with sequence-level knowledge distillation in the domain adaptation setting. While both domain adaptation and knowledge distillation are widely-used, their interaction remains little understood. Our large-scale empirical results in machine translation (on three language pairs with three domains each) suggest distilling twice for best performance: once using general-domain data and again using in-domain data with an adapted teacher.

37. The NiuTrans System for WNGT 2020 Efficiency Task [PDF] 返回目录
  ACL 2020. the Fourth Workshop on Neural Generation and Translation
  Chi Hu, Bei Li, Yinqiao Li, Ye Lin, Yanyang Li, Chenglong Wang, Tong Xiao, Jingbo Zhu
This paper describes the submissions of the NiuTrans Team to the WNGT 2020 Efficiency Shared Task. We focus on the efficient implementation of deep Transformer models (Wang et al., 2019; Li et al., 2019) using NiuTensor, a flexible toolkit for NLP tasks. We explored the combination of deep encoder and shallow decoder in Transformer models via model compression and knowledge distillation. The neural machine translation decoding also benefits from FP16 inference, attention caching, dynamic batching, and batch pruning. Our systems achieve promising results in both translation quality and efficiency, e.g., our fastest system can translate more than 40,000 tokens per second with an RTX 2080 Ti while maintaining 42.9 BLEU on newstest2018.

38. Exploring the Limits of Simple Learners in Knowledge Distillation for Document Classification with DocBERT [PDF] 返回目录
  ACL 2020. the 5th Workshop on Representation Learning for NLP
  Ashutosh Adhikari, Achyudh Ram, Raphael Tang, William L. Hamilton, Jimmy Lin
Fine-tuned variants of BERT are able to achieve state-of-the-art accuracy on many natural language processing tasks, although at significant computational costs. In this paper, we verify BERT’s effectiveness for document classification and investigate the extent to which BERT-level effectiveness can be obtained by different baselines, combined with knowledge distillation—a popular model compression method. The results show that BERT-level effectiveness can be achieved by a single-layer LSTM with at least 40× fewer FLOPS and only ∼3\% parameters. More importantly, this study analyzes the limits of knowledge distillation as we distill BERT’s knowledge all the way down to linear models—a relevant baseline for the task. We report substantial improvement in effectiveness for even the simplest models, as they capture the knowledge learnt by BERT.

39. Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation [PDF] 返回目录
  EMNLP 2020. Long Paper
  Nils Reimers, Iryna Gurevych
We present an easy and efficient method to extend existing sentence embedding models to new languages. This allows to create multilingual versions from previously monolingual models. The training is based on the idea that a translated sentence should be mapped to the same location in the vector space as the original sentence. We use the original (monolingual) model to generate sentence embeddings for the source language and then train a new system on translated sentences to mimic the original model. Compared to other methods for training multilingual sentence embeddings, this approach has several advantages: It is easy to extend existing models with relatively few samples to new languages, it is easier to ensure desired properties for the vector space, and the hardware requirements for training are lower. We demonstrate the effectiveness of our approach for 50+ languages from various language families. Code to extend sentence embeddings models to more than 400 languages is publicly available.

40. Adversarial Self-Supervised Data Free Distillation for Text Classification [PDF] 返回目录
  EMNLP 2020. Long Paper
  Xinyin Ma, Yongliang Shen, Gongfan Fang, Chen Chen, Chenghao Jia, Weiming Lu
Large pre-trained transformer-based language models have achieved impressive results on a wide range of NLP tasks. In the past few years, Knowledge Distillation(KD) has become a popular paradigm to compress a computationally expensive model to a resource-efficient lightweight model. However, most KD algorithms, especially in NLP, rely on the accessibility of the original training dataset, which may be unavailable due to privacy issues. To tackle this problem, we propose a novel two-stage data-free distillation method, named Adversarial self-Supervised Data-Free Distillation (AS-DFD), which is designed for compressing large-scale transformer-based models (e.g., BERT). To avoid text generation in discrete space, we introduce a Plug & Play Embedding Guessing method to craft pseudo embeddings from the teacher's hidden knowledge. Meanwhile, with a self-supervised module to quantify the student's ability, we adapt the difficulty of pseudo embeddings in an adversarial training manner. To the best of our knowledge, our framework is the first data-free distillation framework designed for NLP tasks. We verify the effectiveness of our method on several text classification datasets.

41. Language Model Prior for Low-Resource Neural Machine Translation [PDF] 返回目录
  EMNLP 2020. Long Paper
  Christos Baziotis, Barry Haddow, Alexandra Birch
The scarcity of large parallel corpora is an important obstacle for neural machine translation. A common solution is to exploit the knowledge of language models (LM) trained on abundant monolingual data. In this work, we propose a novel approach to incorporate a LM as prior in a neural translation model (TM). Specifically, we add a regularization term, which pushes the output distributions of the TM to be probable under the LM prior, while avoiding wrong predictions when the TM "disagrees" with the LM. This objective relates to knowledge distillation, where the LM can be viewed as teaching the TM about the target language. The proposed approach does not compromise decoding speed, because the LM is used only at training time, unlike previous work that requires it during inference. We present an analysis of the effects that different methods have on the distributions of the TM. Results on two low-resource machine translation datasets show clear improvements even with limited monolingual data.

42. Lifelong Language Knowledge Distillation [PDF] 返回目录
  EMNLP 2020. Long Paper
  Yung-Sung Chuang, Shang-Yu Su, Yun-Nung Chen
It is challenging to perform lifelong language learning (LLL) on a stream of different tasks without any performance degradation comparing to the multi-task counterparts. To address this issue, we present Lifelong Language Knowledge Distillation (L2KD), a simple but efficient method that can be easily applied to existing LLL architectures in order to mitigate the degradation. Specifically, when the LLL model is trained on a new task, we assign a teacher model to first learn the new task, and pass the knowledge to the LLL model via knowledge distillation. Therefore, the LLL model can better adapt to the new task while keeping the previously learned knowledge. Experiments show that the proposed L2KD consistently improves previous state-of-the-art models, and the degradation comparing to multi-task models in LLL tasks is well mitigated for both sequence generation and text classification tasks.

43. BERT-of-Theseus: Compressing BERT by Progressive Module Replacing [PDF] 返回目录
  EMNLP 2020. Long Paper
  Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei, Ming Zhou
In this paper, we propose a novel model compression approach to effectively compress BERT by progressive module replacing. Our approach first divides the original BERT into several modules and builds their compact substitutes. Then, we randomly replace the original modules with their substitutes to train the compact modules to mimic the behavior of the original modules. We progressively increase the probability of replacement through the training. In this way, our approach brings a deeper level of interaction between the original and compact models. Compared to the previous knowledge distillation approaches for BERT compression, our approach does not introduce any additional loss function. Our approach outperforms existing knowledge distillation approaches on GLUE benchmark, showing a new perspective of model compression.

44. Bridging the Gap between Prior and Posterior Knowledge Selection for Knowledge-Grounded Dialogue Generation [PDF] 返回目录
  EMNLP 2020. Long Paper
  Xiuyi Chen, Fandong Meng, Peng Li, Feilong Chen, Shuang Xu, Bo Xu, Jie Zhou
Knowledge selection plays an important role in knowledge-grounded dialogue, which is a challenging task to generate more informative responses by leveraging external knowledge. Recently, latent variable models have been proposed to deal with the diversity of knowledge selection by using both prior and posterior distributions over knowledge and achieve promising performance. However, these models suffer from a huge gap between prior and posterior knowledge selection. Firstly, the prior selection module may not learn to select knowledge properly because of lacking the necessary posterior information. Secondly, latent variable models suffer from the exposure bias that dialogue generation is based on the knowledge selected from the posterior distribution at training but from the prior distribution at inference. Here, we deal with these issues on two aspects: (1) We enhance the prior selection module with the necessary posterior information obtained from the specially designed Posterior Information Prediction Module (PIPM); (2) We propose a Knowledge Distillation Based Training Strategy (KDBTS) to train the decoder with the knowledge selected from the prior distribution, removing the exposure bias of knowledge selection. Experimental results on two knowledge-grounded dialogue datasets show that both PIPM and KDBTS achieve performance improvement over the state-of-the-art latent variable model and their combination shows further improvement.

45. Autoregressive Knowledge Distillation through Imitation Learning [PDF] 返回目录
  EMNLP 2020. Long Paper
  Alexander Lin, Jeremy Wohlwend, Howard Chen, Tao Lei
The performance of autoregressive models on natural language generation tasks has dramatically improved due to the adoption of deep, self-attentive architectures. However, these gains have come at the cost of hindering inference speed, making state-of-the-art models cumbersome to deploy in real-world, time-sensitive settings. We develop a compression technique for autoregressive models that is driven by an imitation learning perspective on knowledge distillation. The algorithm is designed to address the exposure bias problem. On prototypical language generation tasks such as translation and summarization, our method consistently outperforms other distillation algorithms, such as sequence-level knowledge distillation. Student models trained with our method attain 1.4 to 4.8 BLEU/ROUGE points higher than those trained from scratch, while increasing inference speed by up to 14 times in comparison to the teacher model.

46. TernaryBERT: Distillation-aware Ultra-low Bit BERT [PDF] 返回目录
  EMNLP 2020. Long Paper
  Wei Zhang, Lu Hou, Yichun Yin, Lifeng Shang, Xiao Chen, Xin Jiang, Qun Liu
Transformer-based pre-training models like BERT have achieved remarkable performance in many natural language processing tasks. However, these models are both computation and memory expensive, hindering their deployment to resource-constrained devices. In this work, we propose TernaryBERT, which ternarizes the weights in a fine-tuned BERT model. Specifically, we use both approximation-based and loss-aware ternarization methods and empirically investigate the ternarization granularity of different parts of BERT. Moreover, to reduce the accuracy degradation caused by lower capacity of low bits, we leverage the knowledge distillation technique in the training process. Experiments on the GLUE benchmark and SQuAD show that our proposed TernaryBERT outperforms the other BERT quantization methods, and even achieves comparable performance as the full-precision model while being 14.9x smaller.

47. Improving Neural Topic Models Using Knowledge Distillation [PDF] 返回目录
  EMNLP 2020. Long Paper
  Alexander Miserlis Hoyle, Pranav Goel, Philip Resnik
Topic models are often used to identify human-interpretable topics to help make sense of large document collections. We use knowledge distillation to combine the best attributes of probabilistic topic models and pretrained transformers. Our modular method can be straightforwardly applied with any neural topic model to improve topic quality, which we demonstrate using two models having disparate architectures, obtaining state-of-the-art topic coherence. We show that our adaptable framework not only improves performance in the aggregate over all estimated topics, as is commonly reported, but also in head-to-head comparisons of aligned topics.

48. FedED: Federated Learning via Ensemble Distillation for Medical Relation Extraction [PDF] 返回目录
  EMNLP 2020. Long Paper
  Dianbo Sui, Yubo Chen, Jun Zhao, Yantao Jia, Yuantao Xie, Weijian Sun
Unlike other domains, medical texts are inevitably accompanied by private information, so sharing or copying these texts is strictly restricted. However, training a medical relation extraction model requires collecting these privacy-sensitive texts and storing them on one machine, which comes in conflict with privacy protection. In this paper, we propose a privacy-preserving medical relation extraction model based on federated learning, which enables training a central model with no single piece of private local data being shared or exchanged. Though federated learning has distinct advantages in privacy protection, it suffers from the communication bottleneck, which is mainly caused by the need to upload cumbersome local parameters. To overcome this bottleneck, we leverage a strategy based on knowledge distillation. Such a strategy uses the uploaded predictions of ensemble local models to train the central model without requiring uploading local parameters. Experiments on three publicly available medical relation extraction datasets demonstrate the effectiveness of our method.

49. Scalable Zero-shot Entity Linking with Dense Entity Retrieval [PDF] 返回目录
  EMNLP 2020. Long Paper
  Ledell Wu, Fabio Petroni, Martin Josifoski, Sebastian Riedel, Luke Zettlemoyer
This paper introduces a conceptually simple, scalable, and highly effective BERT-based entity linking model, along with an extensive evaluation of its accuracy-speed trade-off. We present a two-stage zero-shot linking algorithm, where each entity is defined only by a short textual description. The first stage does retrieval in a dense space defined by a bi-encoder that independently embeds the mention context and the entity descriptions. Each candidate is then re-ranked with a cross-encoder, that concatenates the mention and entity text. Experiments demonstrate that this approach is state of the art on recent zero-shot benchmarks (6 point absolute gains) and also on more established non-zero-shot evaluations (e.g. TACKBP-2010), despite its relative simplicity (e.g. no explicit entity embeddings or manually engineered mention tables). We also show that bi-encoder linking is very fast with nearest neighbor search (e.g. linking with 5.9 million candidates in 2 milliseconds), and that much of the accuracy gain from the more expensive cross-encoder can be transferred to the bi-encoder via knowledge distillation. Our code and models are available at https://github.com/facebookresearch/BLINK.

50. Contrastive Distillation on Intermediate Representations for Language Model Compression [PDF] 返回目录
  EMNLP 2020. Long Paper
  Siqi Sun, Zhe Gan, Yuwei Fang, Yu Cheng, Shuohang Wang, Jingjing Liu
Existing language model compression methods mostly use a simple L_2 loss to distill knowledge in the intermediate representations of a large BERT model to a smaller one. Although widely used, this objective by design assumes that all the dimensions of hidden representations are independent, failing to capture important structural knowledge in the intermediate layers of the teacher network. To achieve better distillation efficacy, we propose Contrastive Distillation on Intermediate Representations (CoDIR), a principled knowledge distillation framework where the student is trained to distill knowledge through intermediate layers of the teacher via a contrastive objective. By learning to distinguish positive sample from a large set of negative samples,CoDIR facilitates the student's exploitation of rich information in teacher's hidden layers. CoDIR can be readily applied to compress large-scale language models in both pre-training and finetuning stages, and achieves superb performance on the GLUE benchmark, outperforming state-of-the-art compression methods.

51. Distilling Structured Knowledge for Text-Based Relational Reasoning [PDF] 返回目录
  EMNLP 2020. Short Paper
  Jin Dong, Marc-Antoine Rondeau, William L. Hamilton
There is an increasing interest in developing text-based relational reasoning systems, which are capable of systematically reasoning about the relationships between entities mentioned in a text. However, there remains a substantial performance gap between NLP models for relational reasoning and models based on graph neural networks (GNNs), which have access to an underlying symbolic representation of the text. In this work, we investigate how the structured knowledge of a GNN can be distilled into various NLP models in order to improve their performance. We first pre-train a GNN on a reasoning task using structured inputs and then incorporate its knowledge into an NLP model (e.g., an LSTM) via knowledge distillation. To overcome the difficulty of cross-modal knowledge transfer, we also employ a contrastive learning based module to align the latent representations of NLP models and the GNN. We test our approach with two state-of-the-art NLP models on 13 different inductive reasoning datasets from the CLUTRR benchmark and obtain significant improvements.

52. Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers [PDF] 返回目录
  EMNLP 2020. Short Paper
  Yimeng Wu, Peyman Passban, Mehdi Rezagholizadeh, Qun Liu
With the growth of computing power neural machine translation (NMT) models also grow accordingly and become better. However, they also become harder to deploy on edge devices due to memory constraints. To cope with this problem, a common practice is to distill knowledge from a large and accurately-trained teacher network (T) into a compact student network (S). Although knowledge distillation (KD) is useful in most cases, our study shows that existing KD techniques might not be suitable enough for deep NMT engines, so we propose a novel alternative. In our model, besides matching T and S predictions we have a combinatorial mechanism to inject layer-level supervision from T to S. In this paper, we target low-resource settings and evaluate our translation engines for Portuguese→English, Turkish→English, and English→German directions. Students trained using our technique have 50% fewer parameters and can still deliver comparable results to those of 12-layer teachers.

53. Mimic and Conquer: Heterogeneous Tree Structure Distillation for Syntactic NLP [PDF] 返回目录
  EMNLP 2020. Findings Short Paper
  Hao Fei, Yafeng Ren, Donghong Ji
Syntax has been shown useful for various NLP tasks, while existing work mostly encodes singleton syntactic tree using one hierarchical neural network. In this paper, we investigate a simple and effective method, Knowledge Distillation, to integrate heterogeneous structure knowledge into a unified sequential LSTM encoder. Experimental results on four typical syntax-dependent tasks show that our method outperforms tree encoders by effectively integrating rich heterogeneous structure syntax, meanwhile reducing error propagation, and also outperforms ensemble methods, in terms of both the efficiency and accuracy.

54. Using the Past Knowledge to Improve Sentiment Classification [PDF] 返回目录
  EMNLP 2020. Findings Short Paper
  Qi Qin, Wenpeng Hu, Bing Liu
This paper studies sentiment classification in the lifelong learning setting that incrementally learns a sequence of sentiment classification tasks. It proposes a new lifelong learning model (called L2PG) that can retain and selectively transfer the knowledge learned in the past to help learn the new task. A key innovation of this proposed model is a novel parameter-gate (p-gate) mechanism that regulates the flow or transfer of the previously learned knowledge to the new task. Specifically, it can selectively use the network parameters (which represent the retained knowledge gained from the previous tasks) to assist the learning of the new task t. Knowledge distillation is also employed in the process to preserve the past knowledge by approximating the network output at the state when task t-1 was learned. Experimental results show that L2PG outperforms strong baselines, including even multiple task learning.

55. Fast End-to-end Coreference Resolution for Korean [PDF] 返回目录
  EMNLP 2020. Findings Short Paper
  Cheoneum Park, Jamin Shin, Sungjoon Park, Joonho Lim, Changki Lee
Recently, end-to-end neural network-based approaches have shown significant improvements over traditional pipeline-based models in English coreference resolution. However, such advancements came at a cost of computational complexity and recent works have not focused on tackling this problem. Hence, in this paper, to cope with this issue, we propose BERT-SRU-based Pointer Networks that leverages the linguistic property of head-final languages. Applying this model to the Korean coreference resolution, we significantly reduce the coreference linking search space. Combining this with Ensemble Knowledge Distillation, we maintain state-of-the-art performance 66.9% of CoNLL F1 on ETRI test set while achieving 2x speedup (30 doc/sec) in document processing time.

56. Improving Word Embedding Factorization for Compression using Distilled Nonlinear Neural Decomposition [PDF] 返回目录
  EMNLP 2020. Findings Short Paper
  Vasileios Lioutas, Ahmad Rashid, Krtin Kumar, Md. Akmal Haidar, Mehdi Rezagholizadeh
Word-embeddings are vital components of Natural Language Processing (NLP) models and have been extensively explored. However, they consume a lot of memory which poses a challenge for edge deployment. Embedding matrices, typically, contain most of the parameters for language models and about a third for machine translation systems. In this paper, we propose Distilled Embedding, an (input/output) embedding compression method based on low-rank matrix decomposition and knowledge distillation. First, we initialize the weights of our decomposed matrices by learning to reconstruct the full pre-trained word-embedding and then fine-tune end-to-end, employing knowledge distillation on the factorized embedding. We conduct extensive experiments with various compression rates on machine translation and language modeling, using different data-sets with a shared word-embedding matrix for both embedding and vocabulary projection matrices. We show that the proposed technique is simple to replicate, with one fixed parameter controlling compression size, has higher BLEU score on translation and lower perplexity on language modeling compared to complex, difficult to tune state-of-the-art methods.

57. DiPair: Fast and Accurate Distillation for Trillion-ScaleText Matching and Pair Modeling [PDF] 返回目录
  EMNLP 2020. Findings Short Paper
  Jiecao Chen, Liu Yang, Karthik Raman, Michael Bendersky, Jung-Jung Yeh, Yun Zhou, Marc Najork, Danyang Cai, Ehsan Emadzadeh
Pre-trained models like BERT ((Devlin et al., 2018) have dominated NLP / IR applications such as single sentence classification, text pair classification, and question answering. However, deploying these models in real systems is highly non-trivial due to their exorbitant computational costs. A common remedy to this is knowledge distillation (Hinton et al., 2015), leading to faster inference. However – as we show here – existing works are not optimized for dealing with pairs (or tuples) of texts. Consequently, they are either not scalable or demonstrate subpar performance. In this work, we propose DiPair — a novel framework for distilling fast and accurate models on text pair tasks. Coupled with an end-to-end training strategy, DiPair is both highly scalable and offers improved quality-speed tradeoffs. Empirical studies conducted on both academic and real-world e-commerce benchmarks demonstrate the efficacy of the proposed approach with speedups of over 350x and minimal quality drop relative to the cross-attention teacher BERT model.

58. General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference [PDF] 返回目录
  EMNLP 2020. Findings Short Paper
  Jingfei Du, Myle Ott, Haoran Li, Xing Zhou, Veselin Stoyanov
The state of the art on many NLP tasks is currently achieved by large pre-trained language models, which require a considerable amount of computation. We aim to reduce the inference cost in a setting where many different predictions are made on a single piece of text. In that case, computational cost during inference can be amortized over the different predictions (tasks) using a shared text encoder. We compare approaches for training such an encoder and show that encoders pre-trained over multiple tasks generalize well to unseen tasks. We also compare ways of extracting fixed- and limited-size representations from this encoder, including pooling features extracted from multiple layers or positions. Our best approach compares favorably to knowledge distillation, achieving higher accuracy and lower computational cost once the system is handling around 7 tasks. Further, we show that through binary quantization, we can reduce the size of the extracted representations by a factor of 16 to store them for later use. The resulting method offers a compelling solution for using large-scale pre-trained models at a fraction of the computational cost when multiple tasks are performed on the same text.

59. Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning [PDF] 返回目录
  EMNLP 2020. Findings Short Paper
  Bingbing Li, Zhenglun Kong, Tianyun Zhang, Ji Li, Zhengang Li, Hang Liu, Caiwen Ding
Pretrained large-scale language models have increasingly demonstrated high accuracy on many natural language processing (NLP) tasks. However, the limited weight storage and computational speed on hardware platforms have impeded the popularity of pretrained models, especially in the era of edge computing. In this work, we propose an efficient transformer-based large-scale language representation using hardware-friendly block structure pruning. We incorporate the reweighted group Lasso into block-structured pruning for optimization. Besides the significantly reduced weight storage and computation, the proposed approach achieves high compression rates. Experimental results on different models (BERT, RoBERTa, and DistilBERT) on the General Language Understanding Evaluation (GLUE) benchmark tasks show that we achieve up to 5.0x with zero or minor accuracy degradation on certain task(s). Our proposed method is also orthogonal to existing compact pretrained language models such as DistilBERT using knowledge distillation, since a further 1.79x average compression rate can be achieved on top of DistilBERT with zero or minor accuracy degradation. It is suitable to deploy the final compressed model on resource-constrained edge devices.

60. TinyBERT: Distilling BERT for Natural Language Understanding [PDF] 返回目录
  EMNLP 2020. Findings Short Paper
  Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu
Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large “teacher” BERT can be effectively transferred to a small “student” TinyBERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pre-training and task-specific learning stages. This framework ensures that TinyBERT can capture the general-domain as well as the task-specific knowledge in BERT. TinyBERT4 with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERT-Base on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT4 is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only ~28% parameters and ~31% inference time of them. Moreover, TinyBERT6 with 6 layers performs on-par with its teacher BERT-Base.

61. Knowledge Consistency between Neural Networks and Beyond [PDF] 返回目录
  ICLR 2020.
  Ruofan Liang, Tianlin Li, Longfei Li, Jing Wang, Quanshi Zhang
This paper aims to analyze knowledge consistency between pre-trained deep neural networks. We propose a generic definition for knowledge consistency between neural networks at different fuzziness levels. A task-agnostic method is designed to disentangle feature components, which represent the consistent knowledge, from raw intermediate-layer features of each neural network. As a generic tool, our method can be broadly used for different applications. In preliminary experiments, we have used knowledge consistency as a tool to diagnose representations of neural networks. Knowledge consistency provides new insights to explain the success of existing deep-learning techniques, such as knowledge distillation and network compression. More crucially, knowledge consistency can also be used to refine pre-trained networks and boost performance.

62. Understanding Knowledge Distillation in Non-autoregressive Machine Translation [PDF] 返回目录
  ICLR 2020.
  Chunting Zhou, Jiatao Gu, Graham Neubig
Non-autoregressive machine translation (NAT) systems predict a sequence of output tokens in parallel, achieving substantial improvements in generation speed compared to autoregressive models. Existing NAT models usually rely on the technique of knowledge distillation, which creates the training data from a pretrained autoregressive model for better performance. Knowledge distillation is empirically useful, leading to large gains in accuracy for NAT models, but the reason for this success has, as of yet, been unclear. In this paper, we first design systematic experiments to investigate why knowledge distillation is crucial to NAT training. We find that knowledge distillation can reduce the complexity of data sets and help NAT to model the variations in the output data. Furthermore, a strong correlation is observed between the capacity of an NAT model and the optimal complexity of the distilled data for the best translation quality. Based on these findings, we further propose several approaches that can alter the complexity of data sets to improve the performance of NAT models. We achieve the state-of-the-art performance for the NAT-based models, and close the gap with the autoregressive baseline on WMT14 En-De benchmark.

63. Contrastive Representation Distillation [PDF] 返回目录
  ICLR 2020.
  Yonglong Tian, Dilip Krishnan, Phillip Isola
Often we wish to transfer representational knowledge from one neural network to another. Examples include distilling a large network into a smaller one, transferring knowledge from one sensory modality to a second, or ensembling a collection of models into a single estimator. Knowledge distillation, the standard approach to these problems, minimizes the KL divergence between the probabilistic outputs of a teacher and student network. We demonstrate that this objective ignores important structural knowledge of the teacher network. This motivates an alternative objective by which we train a student to capture significantly more information in the teacher's representation of the data. We formulate this objective as contrastive learning. Experiments demonstrate that our resulting new objective outperforms knowledge distillation on a variety of knowledge transfer tasks, including single model compression, ensemble distillation, and cross-modal transfer. When combined with knowledge distillation, our method sets a state of the art in many transfer tasks, sometimes even outperforming the teacher network.

64. Neural Epitome Search for Architecture-Agnostic Network Compression [PDF] 返回目录
  ICLR 2020.
  Daquan Zhou, Xiaojie Jin, Qibin Hou, Kaixin Wang, Jianchao Yang, Jiashi Feng
Traditional compression methods including network pruning, quantization, low rank factorization and knowledge distillation all assume that network architectures and parameters should be hardwired. In this work, we propose a new perspective on network compression, i.e., network parameters can be disentangled from the architectures. From this viewpoint, we present the Neural Epitome Search (NES), a new neural network compression approach that learns to find compact yet expressive epitomes for weight parameters of a specified network architecture end-to-end. The complete network to compress can be generated from the learned epitome via a novel transformation method that adaptively transforms the epitomes to match shapes of the given architecture. Compared with existing compression methods, NES allows the weight tensors to be independent of the architecture design and hence can achieve a good trade-off between model compression rate and performance given a specific model size constraint. Experiments demonstrate that, on ImageNet, when taking MobileNetV2 as backbone, our approach improves the full-model baseline by 1.47% in top-1 accuracy with 25% MAdd reduction and AutoML for Model Compression (AMC) by 2.5% with nearly the same compression ratio. Moreover, taking EfficientNet-B0 as baseline, our NES yields an improvement of 1.2% but are with 10% less MAdd. In particular, our method achieves a new state-of-the-art results of 77.5% under mobile settings (<350m madd). code will be made publicly available.< font>

65. Lifelong Zero-Shot Learning [PDF] 返回目录
  IJCAI 2020.
  Kun Wei, Cheng Deng, Xu Yang
Zero-Shot Learning (ZSL) handles the problem that some testing classes never appear in training set. Existing ZSL methods are designed for learning from a fixed training set, which do not have the ability to capture and accumulate the knowledge of multiple training sets, causing them infeasible to many real-world applications. In this paper, we propose a new ZSL setting, named as Lifelong Zero-Shot Learning (LZSL), which aims to accumulate the knowledge during the learning from multiple datasets and recognize unseen classes of all trained datasets. Besides, a novel method is conducted to realize LZSL, which effectively alleviates the Catastrophic Forgetting in the continuous training process. Specifically, considering those datasets containing different semantic embeddings, we utilize Variational Auto-Encoder to obtain unified semantic representations. Then, we leverage selective retraining strategy to preserve thetrained weights of previous tasks and avoid negative transfer when fine-tuning the entire model. Finally, knowledge distillation is employed to transfer knowledge from previous training stages to current stage. We also design the LZSL evaluation protocol and the challenging benchmarks. Extensive experiments on these benchmarks indicate that our method tackles LZSL problem effectively, while existing ZSL methods fail.

66. AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural Architecture Search [PDF] 返回目录
  IJCAI 2020.
  Daoyuan Chen, Yaliang Li, Minghui Qiu, Zhen Wang, Bofang Li, Bolin Ding, Hongbo Deng, Jun Huang, Wei Lin, Jingren Zhou
Large pre-trained language models such as BERT have shown their effectiveness in various natural language processing tasks. However, the huge parameter size makes them difficult to be deployed in real-time applications that require quick inference with limited resources. Existing methods compress BERT into small models while such compression is task-independent, i.e., the same compressed BERT for all different downstream tasks. Motivated by the necessity and benefits of task-oriented BERT compression, we propose a novel compression method, AdaBERT, that leverages differentiable Neural Architecture Search to automatically compress BERT into task-adaptive small models for specific tasks. We incorporate a task-oriented knowledge distillation loss to provide search hints and an efficiency-aware loss as search constraints, which enables a good trade-off between efficiency and effectiveness for task-adaptive BERT compression. We evaluate AdaBERT on several NLP tasks, and the results demonstrate that those task-adaptive compressed models are 12.7x to 29.3x faster than BERT in inference time and 11.5x to 17.0x smaller in terms of parameter size, while comparable performance is maintained.

67. Dual Policy Distillation [PDF] 返回目录
  IJCAI 2020.
  Kwei-Herng Lai, Daochen Zha, Yuening Li, Xia Hu
Policy distillation, which transfers a teacher policy to a student policy has achieved great success in challenging tasks of deep reinforcement learning. This teacher-student framework requires a well-trained teacher model which is computationally expensive. Moreover, the performance of the student model could be limited by the teacher model if the teacher model is not optimal. In the light of collaborative learning, we study the feasibility of involving joint intellectual efforts from diverse perspectives of student models. In this work, we introduce dual policy distillation (DPD), a student-student framework in which two learners operate on the same environment to explore different perspectives of the environment and extract knowledge from each other to enhance their learning. The key challenge in developing this dual learning framework is to identify the beneficial knowledge from the peer learner for contemporary learning-based reinforcement learning algorithms, since it is unclear whether the knowledge distilled from an imperfect and noisy peer learner would be helpful. To address the challenge, we theoretically justify that distilling knowledge from a peer learner will lead to policy improvement and propose a disadvantageous distillation strategy based on the theoretical results. The conducted experiments on several continuous control tasks show that the proposed framework achieves superior performance with a learning-based agent and function approximation without the use of expensive teacher models.

68. P-KDGAN: Progressive Knowledge Distillation with GANs for One-class Novelty Detection [PDF] 返回目录
  IJCAI 2020.
  Zhiwei Zhang, Shifeng Chen, Lei Sun
One-class novelty detection is to identify anomalous instances that do not conform to the expected normal instances. In this paper, the Generative Adversarial Networks (GANs) based on encoder-decoder-encoder pipeline are used for detection and achieve state-of-the-art performance. However, deep neural networks are too over-parameterized to deploy on resource-limited devices. Therefore, Progressive Knowledge Distillation with GANs (P-KDGAN) is proposed to learn compact and fast novelty detection networks. The P-KDGAN is a novel attempt to connect two standard GANs by the designed distillation loss for transferring knowledge from the teacher to the student. The progressive learning of knowledge distillation is a two-step approach that continuously improves the performance of the student GAN and achieves better performance than single step methods. In the first step, the student GAN learns the basic knowledge totally from the teacher via guiding of the pre-trained teacher GAN with fixed weights. In the second step, joint fine-training is adopted for the knowledgeable teacher and student GANs to further improve the performance and stability. The experimental results on CIFAR-10, MNIST, and FMNIST show that our method improves the performance of the student GAN by 2.44%, 1.77%, and 1.73% when compressing the computation at ratios of 24.45:1, 311.11:1, and 700:1, respectively.

69. An Iterative Multi-Source Mutual Knowledge Transfer Framework for Machine Reading Comprehension [PDF] 返回目录
  IJCAI 2020.
  Xin Liu, Kai Liu, Xiang Li, Jinsong Su, Yubin Ge, Bin Wang, Jiebo Luo
The lack of sufficient training data in many domains, poses a major challenge to the construction of domain-specific machine reading comprehension (MRC) models with satisfying performance. In this paper, we propose a novel iterative multi-source mutual knowledge transfer framework for MRC. As an extension of the conventional knowledge transfer with one-to-one correspondence, our framework focuses on the many-to-many mutual transfer, which involves synchronous executions of multiple many-to-one transfers in an iterative manner.Specifically, to update a target-domain MRC model, we first consider other domain-specific MRC models as individual teachers, and employ knowledge distillation to train a multi-domain MRC model, which is differentially required to fit the training data and match the outputs of these individual models according to their domain-level similarities to the target domain. After being initialized by the multi-domain MRC model, the target-domain MRC model is fine-tuned to match both its training data and the output of its previous best model simultaneously via knowledge distillation. Compared with previous approaches, our framework can continuously enhance all domain-specific MRC models by enabling each model to iteratively and differentially absorb the domain-shared knowledge from others. Experimental results and in-depth analyses on several benchmark datasets demonstrate the effectiveness of our framework.

70. UniTrans : Unifying Model Transfer and Data Transfer for Cross-Lingual Named Entity Recognition with Unlabeled Data [PDF] 返回目录
  IJCAI 2020.
  Qianhui Wu, Zijia Lin, Börje F. Karlsson, Biqing Huang, Jianguang Lou
Prior work in cross-lingual named entity recognition (NER) with no/little labeled data falls into two primary categories: model transfer- and data transfer-based methods. In this paper, we find that both method types can complement each other, in the sense that, the former can exploit context information via language-independent features but sees no task-specific information in the target language; while the latter generally generates pseudo target-language training data via translation but its exploitation of context information is weakened by inaccurate translations. Moreover, prior work rarely leverages unlabeled data in the target language, which can be effortlessly collected and potentially contains valuable information for improved results. To handle both problems, we propose a novel approach termed UniTrans to Unify both model and data Transfer for cross-lingual NER, and furthermore, leverage the available information from unlabeled target-language data via enhanced knowledge distillation. We evaluate our proposed UniTrans over 4 target languages on benchmark datasets. Our experimental results show that it substantially outperforms the existing state-of-the-art methods.

71. Private Model Compression via Knowledge Distillation [PDF] 返回目录
  AAAI 2019. AAAI Technical Track: Applications
  Ji Wang, Weidong Bao, Lichao Sun, Xiaomin Zhu, Bokai Cao, Philip S. Yu
The soaring demand for intelligent mobile applications calls for deploying powerful deep neural networks (DNNs) on mobile devices. However, the outstanding performance of DNNs notoriously relies on increasingly complex models, which in turn is associated with an increase in computational expense far surpassing mobile devices’ capacity. What is worse, app service providers need to collect and utilize a large volume of users’ data, which contain sensitive information, to build the sophisticated DNN models. Directly deploying these models on public mobile devices presents prohibitive privacy risk. To benefit from the on-device deep learning without the capacity and privacy concerns, we design a private model compression framework RONA. Following the knowledge distillation paradigm, we jointly use hint learning, distillation learning, and self learning to train a compact and fast neural network. The knowledge distilled from the cumbersome model is adaptively bounded and carefully perturbed to enforce differential privacy. We further propose an elegant query sample selection method to reduce the number of queries and control the privacy loss. A series of empirical evaluations as well as the implementation on an Android mobile device show that RONA can not only compress cumbersome models efficiently but also provide a strong privacy guarantee. For example, on SVHN, when a meaningful (9.83,10−6)-differential privacy is guaranteed, the compact model trained by RONA can obtain 20× compression ratio and 19× speed-up with merely 0.97% accuracy loss.

72. Knowledge Distillation with Adversarial Samples Supporting Decision Boundary [PDF] 返回目录
  AAAI 2019. AAAI Technical Track: Machine Learning
  Byeongho Heo, Minsik Lee, Sangdoo Yun, Jin Young Choi
Many recent works on knowledge distillation have provided ways to transfer the knowledge of a trained network for improving the learning process of a new one, but finding a good technique for knowledge distillation is still an open problem. In this paper, we provide a new perspective based on a decision boundary, which is one of the most important component of a classifier. The generalization performance of a classifier is closely related to the adequacy of its decision boundary, so a good classifier bears a good decision boundary. Therefore, transferring information closely related to the decision boundary can be a good attempt for knowledge distillation. To realize this goal, we utilize an adversarial attack to discover samples supporting a decision boundary. Based on this idea, to transfer more accurate information about the decision boundary, the proposed algorithm trains a student classifier based on the adversarial samples supporting the decision boundary. Experiments show that the proposed method indeed improves knowledge distillation and achieves the state-of-the-arts performance.

73. Data-Distortion Guided Self-Distillation for Deep Neural Networks [PDF] 返回目录
  AAAI 2019. AAAI Technical Track: Machine Learning
  Ting-Bing Xu, Cheng-Lin Liu
Knowledge distillation is an effective technique that has been widely used for transferring knowledge from a network to another network. Despite its effective improvement of network performance, the dependence of accompanying assistive models complicates the training process of single network in the need of large memory and time cost. In this paper, we design a more elegant self-distillation mechanism to transfer knowledge between different distorted versions of same training data without the reliance on accompanying models. Specifically, the potential capacity of single network is excavated by learning consistent global feature distributions and posterior distributions (class probabilities) across these distorted versions of data. Extensive experiments on multiple datasets (i.e., CIFAR-10/100 and ImageNet) demonstrate that the proposed method can effectively improve the generalization performance of various network architectures (such as AlexNet, ResNet, Wide ResNet, and DenseNet), outperform existing distillation methods with little extra training efforts.

74. Exploiting the Ground-Truth: An Adversarial Imitation Based Knowledge Distillation Approach for Event Detection [PDF] 返回目录
  AAAI 2019. AAAI Technical Track: Natural Language Processing
  Jian Liu, Yubo Chen, Kang Liu
The ambiguity in language expressions poses a great challenge for event detection. To disambiguate event types, current approaches rely on external NLP toolkits to build knowledge representations. Unfortunately, these approaches work in a pipeline paradigm and suffer from error propagation problem. In this paper, we propose an adversarial imitation based knowledge distillation approach, for the first time, to tackle the challenge of acquiring knowledge from rawsentences for event detection. In our approach, a teacher module is first devised to learn the knowledge representations from the ground-truth annotations. Then, we set up a student module that only takes the raw-sentences as the input. The student module is taught to imitate the behavior of the teacher under the guidance of an adversarial discriminator. By this way, the process of knowledge distillation from rawsentence has been implicitly integrated into the feature encoding stage of the student module. To the end, the enhanced student is used for event detection, which processes raw texts and requires no extra toolkits, naturally eliminating the error propagation problem faced by pipeline approaches. We conduct extensive experiments on the ACE 2005 datasets, and the experimental results justify the effectiveness of our approach.

75. Scalable Syntax-Aware Language Models Using Knowledge Distillation [PDF] 返回目录
  ACL 2019.
  Adhiguna Kuncoro, Chris Dyer, Laura Rimell, Stephen Clark, Phil Blunsom
Prior work has shown that, on small amounts of training data, syntactic neural language models learn structurally sensitive generalisations more successfully than sequential language models. However, their computational complexity renders scaling difficult, and it remains an open question whether structural biases are still necessary when sequential models have access to ever larger amounts of training data. To answer this question, we introduce an efficient knowledge distillation (KD) technique that transfers knowledge from a syntactic language model trained on a small corpus to an LSTM language model, hence enabling the LSTM to develop a more structurally sensitive representation of the larger training data it learns from. On targeted syntactic evaluations, we find that, while sequential LSTMs perform much better than previously reported, our proposed technique substantially improves on this baseline, yielding a new state of the art. Our findings and analysis affirm the importance of structural biases, even in models that learn from large amounts of data.

76. BAM! Born-Again Multi-Task Networks for Natural Language Understanding [PDF] 返回目录
  ACL 2019.
  Kevin Clark, Minh-Thang Luong, Urvashi Khandelwal, Christopher D. Manning, Quoc V. Le
It can be challenging to train multi-task neural networks that outperform or even match their single-task counterparts. To help address this, we propose using knowledge distillation where single-task models teach a multi-task model. We enhance this training with teacher annealing, a novel method that gradually transitions the model from distillation to supervised learning, helping the multi-task model surpass its single-task teachers. We evaluate our approach by multi-task fine-tuning BERT on the GLUE benchmark. Our method consistently improves over standard single-task and multi-task training.

77. PANLP at MEDIQA 2019: Pre-trained Language Models, Transfer Learning and Knowledge Distillation [PDF] 返回目录
  ACL 2019. the 18th BioNLP Workshop and Shared Task
  Wei Zhu, Xiaofeng Zhou, Keqiang Wang, Xun Luo, Xiepeng Li, Yuan Ni, Guotong Xie
This paper describes the models designated for the MEDIQA 2019 shared tasks by the team PANLP. We take advantages of the recent advances in pre-trained bidirectional transformer language models such as BERT (Devlin et al., 2018) and MT-DNN (Liu et al., 2019b). We find that pre-trained language models can significantly outperform traditional deep learning models. Transfer learning from the NLI task to the RQE task is also experimented, which proves to be useful in improving the results of fine-tuning MT-DNN large. A knowledge distillation process is implemented, to distill the knowledge contained in a set of models and transfer it into an single model, whose performance turns out to be comparable with that obtained by the ensemble of that set of models. Finally, for test submissions, model ensemble and a re-ranking process are implemented to boost the performances. Our models participated in all three tasks and ranked the 1st place for the RQE task, and the 2nd place for the NLI task, and also the 2nd place for the QA task.

78. GTCOM Neural Machine Translation Systems for WMT19 [PDF] 返回目录
  ACL 2019. the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
  Chao Bei, Hao Zong, Conghu Yuan, Qingming Liu, Baoyong Fan
This paper describes the Global Tone Communication Co., Ltd.’s submission of the WMT19 shared news translation task. We participate in six directions: English to (Gujarati, Lithuanian and Finnish) and (Gujarati, Lithuanian and Finnish) to English. Further, we get the best BLEU scores in the directions of English to Gujarati and Lithuanian to English (28.2 and 36.3 respectively) among all the participants. The submitted systems mainly focus on back-translation, knowledge distillation and reranking to build a competitive model for this task. Also, we apply language model to filter monolingual data, back-translated data and parallel data. The techniques we apply for data filtering include filtering by rules, language models. Besides, We conduct several experiments to validate different knowledge distillation techniques and right-to-left (R2L) reranking.

79. The NiuTrans Machine Translation Systems for WMT19 [PDF] 返回目录
  ACL 2019. the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
  Bei Li, Yinqiao Li, Chen Xu, Ye Lin, Jiqiang Liu, Hui Liu, Ziyang Wang, Yuhao Zhang, Nuo Xu, Zeyang Wang, Kai Feng, Hexuan Chen, Tengbo Liu, Yanyang Li, Qiang Wang, Tong Xiao, Jingbo Zhu
This paper described NiuTrans neural machine translation systems for the WMT 2019 news translation tasks. We participated in 13 translation directions, including 11 supervised tasks, namely EN↔{ZH, DE, RU, KK, LT}, GU→EN and the unsupervised DE↔CS sub-track. Our systems were built on Deep Transformer and several back-translation methods. Iterative knowledge distillation and ensemble+reranking were also employed to obtain stronger models. Our unsupervised submissions were based on NMT enhanced by SMT. As a result, we achieved the highest BLEU scores in {KK↔EN, GU→EN} directions, ranking 2nd in {RU→EN, DE↔CS} and 3rd in {ZH→EN, LT→EN, EN→RU, EN↔DE} among all constrained submissions.

80. Baidu Neural Machine Translation Systems for WMT19 [PDF] 返回目录
  ACL 2019. the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
  Meng Sun, Bojian Jiang, Hao Xiong, Zhongjun He, Hua Wu, Haifeng Wang
In this paper we introduce the systems Baidu submitted for the WMT19 shared task on Chinese<->English news translation. Our systems are based on the Transformer architecture with some effective improvements. Data selection, back translation, data augmentation, knowledge distillation, domain adaptation, model ensemble and re-ranking are employed and proven effective in our experiments. Our Chinese->English system achieved the highest case-sensitive BLEU score among all constrained submissions, and our English->Chinese system ranked the second in all submissions.

81. Microsoft Research Asia’s Systems for WMT19 [PDF] 返回目录
  ACL 2019. the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
  Yingce Xia, Xu Tan, Fei Tian, Fei Gao, Di He, Weicong Chen, Yang Fan, Linyuan Gong, Yichong Leng, Renqian Luo, Yiren Wang, Lijun Wu, Jinhua Zhu, Tao Qin, Tie-Yan Liu
We Microsoft Research Asia made submissions to 11 language directions in the WMT19 news translation tasks. We won the first place for 8 of the 11 directions and the second place for the other three. Our basic systems are built on Transformer, back translation and knowledge distillation. We integrate several of our rececent techniques to enhance the baseline systems: multi-agent dual learning (MADL), masked sequence-to-sequence pre-training (MASS), neural architecture optimization (NAO), and soft contextual data augmentation (SCA).

82. Iterative Dual Domain Adaptation for Neural Machine Translation [PDF] 返回目录
  EMNLP 2019.
  Jiali Zeng, Yang Liu, Jinsong Su, Yubing Ge, Yaojie Lu, Yongjing Yin, Jiebo Luo
Previous studies on the domain adaptation for neural machine translation (NMT) mainly focus on the one-pass transferring out-of-domain translation knowledge to in-domain NMT model. In this paper, we argue that such a strategy fails to fully extract the domain-shared translation knowledge, and repeatedly utilizing corpora of different domains can lead to better distillation of domain-shared translation knowledge. To this end, we propose an iterative dual domain adaptation framework for NMT. Specifically, we first pretrain in-domain and out-of-domain NMT models using their own training corpora respectively, and then iteratively perform bidirectional translation knowledge transfer (from in-domain to out-of-domain and then vice versa) based on knowledge distillation until the in-domain NMT model convergences. Furthermore, we extend the proposed framework to the scenario of multiple out-of-domain training corpora, where the above-mentioned transfer is performed sequentially between the in-domain and each out-of-domain NMT models in the ascending order of their domain similarities. Empirical results on Chinese-English and English-German translation tasks demonstrate the effectiveness of our framework.

83. Patient Knowledge Distillation for BERT Model Compression [PDF] 返回目录
  EMNLP 2019.
  Siqi Sun, Yu Cheng, Zhe Gan, Jingjing Liu
Pre-trained language models such as BERT have proven to be highly effective for natural language processing (NLP) tasks. However, the high demand for computing resources in training such models hinders their application in practice. In order to alleviate this resource hunger in large-scale model training, we propose a Patient Knowledge Distillation approach to compress an original large model (teacher) into an equally-effective lightweight shallow network (student). Different from previous knowledge distillation methods, which only use the output from the last layer of the teacher network for distillation, our student model patiently learns from multiple intermediate layers of the teacher model for incremental knowledge extraction, following two strategies: (i) PKD-Last: learning from the last k layers; and (ii) PKD-Skip: learning from every k layers. These two patient distillation schemes enable the exploitation of rich information in the teacher’s hidden layers, and encourage the student model to patiently learn from and imitate the teacher through a multi-layer distillation process. Empirically, this translates into improved results on multiple NLP tasks with a significant gain in training efficiency, without sacrificing model accuracy.

84. Weakly Supervised Cross-lingual Semantic Relation Classification via Knowledge Distillation [PDF] 返回目录
  EMNLP 2019.
  Yogarshi Vyas, Marine Carpuat
Words in different languages rarely cover the exact same semantic space. This work characterizes differences in meaning between words across languages using semantic relations that have been used to relate the meaning of English words. However, because of translation ambiguity, semantic relations are not always preserved by translation. We introduce a cross-lingual relation classifier trained only with English examples and a bilingual dictionary. Our classifier relies on a novel attention-based distillation approach to account for translation ambiguity when transferring knowledge from English to cross-lingual settings. On new English-Chinese and English-Hindi test sets, the resulting models largely outperform baselines that more naively rely on bilingual embeddings or dictionaries for cross-lingual transfer, and approach the performance of fully supervised systems on English tasks.

85. Generation-Distillation for Efficient Natural Language Understanding in Low-Data Settings [PDF] 返回目录
  EMNLP 2019. the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019)
  Luke Melas-Kyriazi, George Han, Celine Liang
Over the past year, the emergence of transfer learning with large-scale language models (LM) has led to dramatic performance improvements across a broad range of natural language understanding tasks. However, the size and memory footprint of these large LMs often makes them difficult to deploy in many scenarios (e.g. on mobile phones). Recent research points to knowledge distillation as a potential solution, showing that when training data for a given task is abundant, it is possible to distill a large (teacher) LM into a small task-specific (student) network with minimal loss of performance. However, when such data is scarce, there remains a significant performance gap between large pretrained LMs and smaller task-specific models, even when training via distillation. In this paper, we bridge this gap with a novel training approach, called generation-distillation, that leverages large finetuned LMs in two ways: (1) to generate new (unlabeled) training examples, and (2) to distill their knowledge into a small network using these examples. Across three low-resource text classification datsets, we achieve comparable performance to BERT while using 300 times fewer parameters, and we outperform prior approaches to distillation for text classification while using 3 times fewer parameters.

86. Natural Language Generation for Effective Knowledge Distillation [PDF] 返回目录
  EMNLP 2019. the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019)
  Raphael Tang, Yao Lu, Jimmy Lin
Knowledge distillation can effectively transfer knowledge from BERT, a deep language representation model, to traditional, shallow word embedding-based neural networks, helping them approach or exceed the quality of other heavyweight language representation models. As shown in previous work, critical to this distillation procedure is the construction of an unlabeled transfer dataset, which enables effective knowledge transfer. To create transfer set examples, we propose to sample from pretrained language models fine-tuned on task-specific text. Unlike previous techniques, this directly captures the purpose of the transfer set. We hypothesize that this principled, general approach outperforms rule-based techniques. On four datasets in sentiment classification, sentence similarity, and linguistic acceptability, we show that our approach improves upon previous methods. We outperform OpenAI GPT, a deep pretrained transformer, on three of the datasets, while using a single-layer bidirectional LSTM that runs at least ten times faster.

87. Multilingual Neural Machine Translation with Knowledge Distillation [PDF] 返回目录
  ICLR 2019.
  Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, Tie-Yan Liu
Multilingual machine translation, which translates multiple languages with a single model, has attracted much attention due to its efficiency of offline training and online serving. However, traditional multilingual translation usually yields inferior accuracy compared with the counterpart using individual models for each language pair, due to language diversity and model capacity limitations. In this paper, we propose a distillation-based approach to boost the accuracy of multilingual machine translation. Specifically, individual models are first trained and regarded as teachers, and then the multilingual model is trained to fit the training data and match the outputs of individual models simultaneously through knowledge distillation. Experiments on IWSLT, WMT and Ted talk translation datasets demonstrate the effectiveness of our method. Particularly, we show that one model is enough to handle multiple languages (up to 44 languages in our experiment), with comparable or even better accuracy than individual models.

88. A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation [PDF] 返回目录
  ICLR 2019.
  Akhilesh Gotmare, Nitish Shirish Keskar, Caiming Xiong, Richard Socher
The convergence rate and final performance of common deep learning models have significantly benefited from recently proposed heuristics such as learning rate schedules, knowledge distillation, skip connections and normalization layers. In the absence of theoretical underpinnings, controlled experiments aimed at explaining the efficacy of these strategies can aid our understanding of deep learning landscapes and the training dynamics. Existing approaches for empirical analysis rely on tools of linear interpolation and visualizations with dimensionality reduction, each with their limitations. Instead, we revisit the empirical analysis of heuristics through the lens of recently proposed methods for loss surface and representation analysis, viz. mode connectivity and canonical correlation analysis (CCA), and hypothesize reasons why the heuristics succeed. In particular, we explore knowledge distillation and learning rate heuristics of (cosine) restarts and warmup using mode connectivity and CCA. Our empirical analysis suggests that: (a) the reasons often quoted for the success of cosine annealing are not evidenced in practice; (b) that the effect of learning rate warmup is to prevent the deeper layers from creating training instability; and (c) that the latent knowledge shared by the teacher is primarily disbursed in the deeper layers.

89. Zero-Shot Knowledge Distillation in Deep Networks [PDF] 返回目录
  ICML 2019.
  Gaurav Kumar Nayak, Konda Reddy Mopuri, Vaisakh Shaj, Venkatesh Babu Radhakrishnan, Anirban Chakraborty
Knowledge distillation deals with the problem of training a smaller model (Student) from a high capacity source model (Teacher) so as to retain most of its performance. Existing approaches use either the training data or meta-data extracted from it in order to train the Student. However, accessing the dataset on which the Teacher has been trained may not always be feasible if the dataset is very large or it poses privacy or safety concerns (e.g., bio-metric or medical data). Hence, in this paper, we propose a novel data-free method to train the Student from the Teacher. Without even using any meta-data, we synthesize the Data Impressions from the complex Teacher model and utilize these as surrogates for the original training data samples to transfer its learning to Student via knowledge distillation. We, therefore, dub our method “Zero-Shot Knowledge Distillation" and demonstrate that our framework results in competitive generalization performance as achieved by distillation using the actual training data samples on multiple benchmark datasets.

90. Towards Understanding Knowledge Distillation [PDF] 返回目录
  ICML 2019.
  Mary Phuong, Christoph Lampert
Knowledge distillation, i.e., one classifier being trained on the outputs of another classifier, is an empirically very successful technique for knowledge transfer between classifiers. It has even been observed that classifiers learn much faster and more reliably if trained with the outputs of another classifier as soft labels, instead of from ground truth data. So far, however, there is no satisfactory theoretical explanation of this phenomenon. In this work, we provide the first insights into the working mechanisms of distillation by studying the special case of linear and deep linear classifiers. Specifically, we prove a generalization bound that establishes fast convergence of the expected risk of a distillation-trained linear classifier. From the bound and its proof we extract three key factors that determine the success of distillation: * data geometry – geometric properties of the data distribution, in particular class separation, has a direct influence on the convergence speed of the risk; * optimization bias – gradient descent optimization finds a very favorable minimum of the distillation objective; and * strong monotonicity – the expected risk of the student classifier always decreases when the size of the training set grows.

91. Pedestrian Attribute Recognition by Joint Visual-semantic Reasoning and Knowledge Distillation [PDF] 返回目录
  IJCAI 2019.
  Qiaozhe Li, Xin Zhao, Ran He, Kaiqi Huang
Pedestrian attribute recognition in surveillance is a challenging task in computer vision due to significant pose variation, viewpoint change and poor image quality. To achieve effective recognition, this paper presents a graph-based global reasoning framework to jointly model potential visual-semantic relations of attributes and distill auxiliary human parsing knowledge to guide the relational learning. The reasoning framework models attribute groups on a graph and learns a projection function to adaptively assign local visual features to the nodes of the graph. After feature projection, graph convolution is utilized to perform global reasoning between the attribute groups to model their mutual dependencies. Then, the learned node features are projected back to visual space to facilitate knowledge transfer. An additional regularization term is proposed by distilling human parsing knowledge from a pre-trained teacher model to enhance feature representations. The proposed framework is verified on three large scale pedestrian attribute datasets including PETA, RAP, and PA-100k. Experiments show that our method achieves state-of-the-art results.

92. RDPD: Rich Data Helps Poor Data via Imitation [PDF] 返回目录
  IJCAI 2019.
  Shenda Hong, Cao Xiao, Trong Nghia Hoang, Tengfei Ma, Hongyan Li, Jimeng Sun
In many situations, we need to build and deploy separate models in related environments with different data qualities. For example, an environment with strong observation equipments (e.g., intensive care units) often provides high-quality multi-modal data, which are acquired from multiple sensory devices and have rich-feature representations. On the other hand, an environment with poor observation equipment (e.g., at home) only provides low-quality, uni-modal data with poor-feature representations. To deploy a competitive model in a poor-data environment without requiring direct access to multi-modal data acquired from a rich-data environment, this paper develops and presents a knowledge distillation (KD) method (RDPD) to enhance a predictive model trained on poor data using knowledge distilled from a high-complexity model trained on rich, private data. We evaluated RDPD on three real-world datasets and shown that its distilled model consistently outperformed all baselines across all datasets, especially achieving the greatest performance improvement over a model trained only on low-quality data by 24.56% on PR-AUC and 12.21% on ROC-AUC, and over that of a state-of-the-art KD model by 5.91% on PR-AUC and 4.44% on ROC-AUC.

93. Online Distilling from Checkpoints for Neural Machine Translation [PDF] 返回目录
  NAACL 2019.
  Hao-Ran Wei, Shujian Huang, Ran Wang, Xin-yu Dai, Jiajun Chen
Current predominant neural machine translation (NMT) models often have a deep structure with large amounts of parameters, making these models hard to train and easily suffering from over-fitting. A common practice is to utilize a validation set to evaluate the training process and select the best checkpoint. Average and ensemble techniques on checkpoints can lead to further performance improvement. However, as these methods do not affect the training process, the system performance is restricted to the checkpoints generated in original training procedure. In contrast, we propose an online knowledge distillation method. Our method on-the-fly generates a teacher model from checkpoints, guiding the training process to obtain better performance. Experiments on several datasets and language pairs show steady improvement over a strong self-attention-based baseline system. We also provide analysis on data-limited setting against over-fitting. Furthermore, our method leads to an improvement in a machine reading experiment as well.

94. On Knowledge distillation from complex networks for response prediction [PDF] 返回目录
  NAACL 2019.
  Siddhartha Arora, Mitesh M. Khapra, Harish G. Ramaswamy
Recent advances in Question Answering have lead to the development of very complex models which compute rich representations for query and documents by capturing all pairwise interactions between query and document words. This makes these models expensive in space and time, and in practice one has to restrict the length of the documents that can be fed to these models. Such models have also been recently employed for the task of predicting dialog responses from available background documents (e.g., Holl-E dataset). However, here the documents are longer, thereby rendering these complex models infeasible except in select restricted settings. In order to overcome this, we use standard simple models which do not capture all pairwise interactions, but learn to emulate certain characteristics of a complex teacher network. Specifically, we first investigate the conicity of representations learned by a complex model and observe that it is significantly lower than that of simpler models. Based on this insight, we modify the simple architecture to mimic this characteristic. We go further by using knowledge distillation approaches, where the simple model acts as a student and learns to match the output from the complex teacher network. We experiment with the Holl-E dialog data set and show that by mimicking characteristics and matching outputs from a teacher, even a simple network can give improved performance.

95. Network Pruning via Transformable Architecture Search [PDF] 返回目录
  NeurIPS 2019.
  Xuanyi Dong, Yi Yang
Network pruning reduces the computation costs of an over-parameterized network without performance damage. Prevailing pruning algorithms pre-define the width and depth of the pruned networks, and then transfer parameters from the unpruned network to pruned networks. To break the structure limitation of the pruned networks, we propose to apply neural architecture search to search directly for a network with flexible channel and layer sizes. The number of the channels/layers is learned by minimizing the loss of the pruned networks. The feature map of the pruned network is an aggregation of K feature map fragments (generated by K networks of different sizes), which are sampled based on the probability distribution. The loss can be back-propagated not only to the network weights, but also to the parameterized distribution to explicitly tune the size of the channels/layers. Specifically, we apply channel-wise interpolation to keep the feature map with different channel sizes aligned in the aggregation procedure. The maximum probability for the size in each distribution serves as the width and depth of the pruned network, whose parameters are learned by knowledge transfer, e.g., knowledge distillation, from the original networks. Experiments on CIFAR-10, CIFAR-100 and ImageNet demonstrate the effectiveness of our new perspective of network pruning compared to traditional network pruning algorithms. Various searching and knowledge transfer approaches are conducted to show the effectiveness of the two components. Code is at: https://github.com/D-X-Y/NAS-Projects

96. Channel Gating Neural Networks [PDF] 返回目录
  NeurIPS 2019.
  Weizhe Hua, Yuan Zhou, Christopher M. De Sa, Zhiru Zhang, G. Edward Suh
This paper introduces channel gating, a dynamic, fine-grained, and hardware-efficient pruning scheme to reduce the computation cost for convolutional neural networks (CNNs). Channel gating identifies regions in the features that contribute less to the classification result, and skips the computation on a subset of the input channels for these ineffective regions. Unlike static network pruning, channel gating optimizes CNN inference at run-time by exploiting input-specific characteristics, which allows substantially reducing the compute cost with almost no accuracy loss. We experimentally show that applying channel gating in state-of-the-art networks achieves 2.7-8.0x reduction in floating-point operations (FLOPs) and 2.0-4.4x reduction in off-chip memory accesses with a minimal accuracy loss on CIFAR-10. Combining our method with knowledge distillation reduces the compute cost of ResNet-18 by 2.6x without accuracy drop on ImageNet. We further demonstrate that channel gating can be realized in hardware efficiently. Our approach exhibits sparsity patterns that are well-suited to dense systolic arrays with minimal additional hardware. We have designed an accelerator for channel gating networks, which can be implemented using either FPGAs or ASICs. Running a quantized ResNet-18 model for ImageNet, our accelerator achieves an encouraging speedup of 2.4x on average, with a theoretical FLOP reduction of 2.8x.

97. Positive-Unlabeled Compression on the Cloud [PDF] 返回目录
  NeurIPS 2019.
  Yixing Xu, Yunhe Wang, Hanting Chen, Kai Han, Chunjing XU, Dacheng Tao, Chang Xu
Many attempts have been done to extend the great success of convolutional neural networks (CNNs) achieved on high-end GPU servers to portable devices such as smart phones. Providing compression and acceleration service of deep learning models on the cloud is therefore of significance and is attractive for end users. However, existing network compression and acceleration approaches usually fine-tuning the svelte model by requesting the entire original training data (e.g. ImageNet), which could be more cumbersome than the network itself and cannot be easily uploaded to the cloud. In this paper, we present a novel positive-unlabeled (PU) setting for addressing this problem. In practice, only a small portion of the original training set is required as positive examples and more useful training examples can be obtained from the massive unlabeled data on the cloud through a PU classifier with an attention based multi-scale feature extractor. We further introduce a robust knowledge distillation (RKD) scheme to deal with the class imbalance problem of these newly augmented training examples. The superiority of the proposed method is verified through experiments conducted on the benchmark models and datasets. We can use only 8% of uniformly selected data from the ImageNet to obtain an efficient model with comparable performance to the baseline ResNet-34.

98. Knowledge Extraction with No Observable Data [PDF] 返回目录
  NeurIPS 2019.
  Jaemin Yoo, Minyong Cho, Taebum Kim, U Kang
Knowledge distillation is to transfer the knowledge of a large neural network into a smaller one and has been shown to be effective especially when the amount of training data is limited or the size of the student model is very small. To transfer the knowledge, it is essential to observe the data that have been used to train the network since its knowledge is concentrated on a narrow manifold rather than the whole input space. However, the data are not accessible in many cases due to the privacy or confidentiality issues in medical, industrial, and military domains. To the best of our knowledge, there has been no approach that distills the knowledge of a neural network when no data are observable. In this work, we propose KegNet (Knowledge Extraction with Generative Networks), a novel approach to extract the knowledge of a trained deep neural network and to generate artificial data points that replace the missing training data in knowledge distillation. Experiments show that KegNet outperforms all baselines for data-free knowledge distillation. We provide the source code of our paper in https://github.com/snudatalab/KegNet.

99. SCAN: A Scalable Neural Networks Framework Towards Compact and Efficient Models [PDF] 返回目录
  NeurIPS 2019.
  Linfeng Zhang, Zhanhong Tan, Jiebo Song, Jingwei Chen, Chenglong Bao, Kaisheng Ma
Remarkable achievements have been attained by deep neural networks in various applications. However, the increasing depth and width of such models also lead to explosive growth in both storage and computation, which has restricted the deployment of deep neural networks on resource-limited edge devices. To address this problem, we propose the so-called SCAN framework for networks training and inference, which is orthogonal and complementary to existing acceleration and compression methods. The proposed SCAN firstly divides neural networks into multiple sections according to their depth and constructs shallow classifiers upon the intermediate features of different sections. Moreover, attention modules and knowledge distillation are utilized to enhance the accuracy of shallow classifiers. Based on this architecture, we further propose a threshold controlled scalable inference mechanism to approach human-like sample-specific inference. Experimental results show that SCAN can be easily equipped on various neural networks without any adjustment on hyper-parameters or neural networks architectures, yielding significant performance gain on CIFAR100 and ImageNet. Codes will be released on github soon.

100. When does label smoothing help? [PDF] 返回目录
  NeurIPS 2019.
  Rafael Müller, Simon Kornblith, Geoffrey E. Hinton
The generalization and learning speed of a multi-class neural network can often be significantly improved by using soft targets that are a weighted average of the hard targets and the uniform distribution over labels. Smoothing the labels in this way prevents the network from becoming over-confident and label smoothing has been used in many state-of-the-art models, including image classification, language translation and speech recognition. Despite its widespread use, label smoothing is still poorly understood. Here we show empirically that in addition to improving generalization, label smoothing improves model calibration which can significantly improve beam search. However, we also observe that if a teacher network is trained with label smoothing, knowledge distillation into a student network is much less effective. To explain these observations, we visualize how label smoothing changes the representations learned by the penultimate layer of the network. We show that label smoothing encourages the representations of training examples from the same class to group in tight clusters. This results in loss of information in the logits about resemblances between instances of different classes, which is necessary for distillation, but does not hurt generalization or calibration of the model's predictions.

101. Random Path Selection for Continual Learning [PDF] 返回目录
  NeurIPS 2019.
  Jathushan Rajasegaran, Munawar Hayat, Salman H. Khan, Fahad Shahbaz Khan, Ling Shao
Incremental life-long learning is a main challenge towards the long-standing goal of Artificial General Intelligence. In real-life settings, learning tasks arrive in a sequence and machine learning models must continually learn to increment already acquired knowledge. The existing incremental learning approaches fall well below the state-of-the-art cumulative models that use all training classes at once. In this paper, we propose a random path selection algorithm, called RPS-Net, that progressively chooses optimal paths for the new tasks while encouraging parameter sharing and reuse. Our approach avoids the overhead introduced by computationally expensive evolutionary and reinforcement learning based path selection strategies while achieving considerable performance gains. As an added novelty, the proposed model integrates knowledge distillation and retrospection along with the path selection strategy to overcome catastrophic forgetting. In order to maintain an equilibrium between previous and newly acquired knowledge, we propose a simple controller to dynamically balance the model plasticity. Through extensive experiments, we demonstrate that the proposed method surpasses the state-of-the-art performance on incremental learning and by utilizing parallel computation this method can run in constant time with nearly the same efficiency as a conventional deep convolutional neural network.

102. Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems [PDF] 返回目录
  NeurIPS 2019.
  Asma Ghandeharioun, Judy Hanwen Shen, Natasha Jaques, Craig Ferguson, Noah Jones, Agata Lapedriza, Rosalind Picard
Building an open-domain conversational agent is a challenging problem. Current evaluation methods, mostly post-hoc judgments of static conversation, do not capture conversation quality in a realistic interactive context. In this paper, we investigate interactive human evaluation and provide evidence for its necessity; we then introduce a novel, model-agnostic, and dataset-agnostic method to approximate it. In particular, we propose a self-play scenario where the dialog system talks to itself and we calculate a combination of proxies such as sentiment and semantic coherence on the conversation trajectory. We show that this metric is capable of capturing the human-rated quality of a dialog model better than any automated metric known to-date, achieving a significant Pearson correlation (r>.7, p<.05). to investigate the strengths of this novel metric and interactive evaluation in comparison state-of-the-art metrics human static conversations, we perform extended experiments with a set models, including several that make improvements recent hierarchical dialog generation architectures through sentiment semantic knowledge distillation on utterance level. finally, open-source platform built dataset collected allow researchers efficiently deploy evaluate models.< font>

103. Wasserstein Distance Guided Representation Learning for Domain Adaptation [PDF] 返回目录
  AAAI 2018. AAAI18 - Machine Learning Methods
  Jian Shen, Yanru Qu, Weinan Zhang, Yong Yu
Domain adaptation aims at generalizing a high-performance learner on a target domain via utilizing the knowledge distilled from a source domain which has a different but related data distribution. One solution to domain adaptation is to learn domain invariant feature representations while the learned representations should also be discriminative in prediction. To learn such representations, domain adaptation frameworks usually include a domain invariant representation learning approach to measure and reduce the domain discrepancy, as well as a discriminator for classification. Inspired by Wasserstein GAN, in this paper we propose a novel approach to learn domain invariant feature representations, namely Wasserstein Distance Guided Representation Learning (WDGRL). WDGRL utilizes a neural network, denoted by the domain critic, to estimate empirical Wasserstein distance between the source and target samples and optimizes the feature extractor network to minimize the estimated Wasserstein distance in an adversarial manner. The theoretical advantages of Wasserstein distance for domain adaptation lie in its gradient property and promising generalization bound. Empirical studies on common sentiment and image classification adaptation datasets demonstrate that our proposed WDGRL outperforms the state-of-the-art domain invariant representation learning approaches.

104. Cooperative Denoising for Distantly Supervised Relation Extraction [PDF] 返回目录
  COLING 2018.
  Kai Lei, Daoyuan Chen, Yaliang Li, Nan Du, Min Yang, Wei Fan, Ying Shen
Distantly supervised relation extraction greatly reduces human efforts in extracting relational facts from unstructured texts. However, it suffers from noisy labeling problem, which can degrade its performance. Meanwhile, the useful information expressed in knowledge graph is still underutilized in the state-of-the-art methods for distantly supervised relation extraction. In the light of these challenges, we propose CORD, a novelCOopeRativeDenoising framework, which consists two base networks leveraging text corpus and knowledge graph respectively, and a cooperative module involving their mutual learning by the adaptive bi-directional knowledge distillation and dynamic ensemble with noisy-varying instances. Experimental results on a real-world dataset demonstrate that the proposed method reduces the noisy labels and achieves substantial improvement over the state-of-the-art methods.

105. One-Shot Relational Learning for Knowledge Graphs [PDF] 返回目录
  EMNLP 2018.
  Wenhan Xiong, Mo Yu, Shiyu Chang, Xiaoxiao Guo, William Yang Wang
Knowledge graphs (KG) are the key components of various natural language processing applications. To further expand KGs’ coverage, previous studies on knowledge graph completion usually require a large number of positive examples for each relation. However, we observe long-tail relations are actually more common in KGs and those newly added relations often do not have many known triples for training. In this work, we aim at predicting new facts under a challenging setting where only one training instance is available. We propose a one-shot relational learning framework, which utilizes the knowledge distilled by embedding models and learns a matching metric by considering both the learned embeddings and one-hop graph structures. Empirically, our model yields considerable performance improvements over existing embedding models, and also eliminates the need of re-training the embedding models when dealing with newly added relations.

106. Attention-Guided Answer Distillation for Machine Reading Comprehension [PDF] 返回目录
  EMNLP 2018.
  Minghao Hu, Yuxing Peng, Furu Wei, Zhen Huang, Dongsheng Li, Nan Yang, Ming Zhou
Despite that current reading comprehension systems have achieved significant advancements, their promising performances are often obtained at the cost of making an ensemble of numerous models. Besides, existing approaches are also vulnerable to adversarial attacks. This paper tackles these problems by leveraging knowledge distillation, which aims to transfer knowledge from an ensemble model to a single model. We first demonstrate that vanilla knowledge distillation applied to answer span prediction is effective for reading comprehension systems. We then propose two novel approaches that not only penalize the prediction on confusing answers but also guide the training with alignment information distilled from the ensemble. Experiments show that our best student model has only a slight drop of 0.4% F1 on the SQuAD test set compared to the ensemble teacher, while running 12x faster during inference. It even outperforms the teacher on adversarial SQuAD datasets and NarrativeQA benchmark.

107. Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy [PDF] 返回目录
  ICLR 2018.
  Asit K. Mishra, Debbie Marr
Deep learning networks have achieved state-of-the-art accuracies on computer vision workloads like image classification and object detection. The performant systems, however, typically involve big models with numerous parameters. Once trained, a challenging aspect for such top performing models is deployment on resource constrained inference systems -- the models (often deep networks or wide networks or both) are compute and memory intensive. Low precision numerics and model compression using knowledge distillation are popular techniques to lower both the compute requirements and memory footprint of these deployed models. In this paper, we study the combination of these two techniques and show that the performance of low precision networks can be significantly improved by using knowledge distillation techniques. We call our approach Apprentice and show state-of-the-art accuracies using ternary precision and 4-bit precision for many variants of ResNet architecture on ImageNet dataset. We study three schemes in which one can apply knowledge distillation techniques to various stages of the train-and-deploy pipeline.

108. Non-Autoregressive Neural Machine Translation [PDF] 返回目录
  ICLR 2018.
  Jiatao Gu, James Bradbury, Caiming Xiong, Victor O. K. Li, Richard Socher
Existing approaches to neural machine translation condition each output word on previously generated outputs. We introduce a model that avoids this autoregressive property and produces its outputs in parallel, allowing an order of magnitude lower latency during inference. Through knowledge distillation, the use of input token fertilities as a latent variable, and policy gradient fine-tuning, we achieve this at a cost of as little as 2.0 BLEU points relative to the autoregressive Transformer network used as a teacher. We demonstrate substantial cumulative improvements associated with each of the three aspects of our training strategy, and validate our approach on IWSLT 2016 English–German and two WMT language pairs. By sampling fertilities in parallel at inference time, our non-autoregressive model achieves near-state-of-the-art performance of 29.8 BLEU on WMT 2016 English–Romanian.

109. Born-Again Neural Networks [PDF] 返回目录
  ICML 2018.
  Tommaso Furlanello, Zachary Chase Lipton, Michael Tschannen, Laurent Itti, Anima Anandkumar
Knowledge Distillation (KD) consists of transferring “knowledge” from one machine learning model (the teacher) to another (the student). Commonly, the teacher is a high-capacity model with formidable performance, while the student is more compact. By transferring knowledge, one hopes to benefit from the student’s compactness, without sacrificing too much performance. We study KD from a new perspective: rather than compressing models, we train students parameterized identically to their teachers. Surprisingly, these Born-Again Networks (BANs), outperform their teachers significantly, both on computer vision and language modeling tasks. Our experiments with BANs based on DenseNets demonstrate state-of-the-art performance on the CIFAR-10 (3.5%) and CIFAR-100 (15.5%) datasets, by validation error. Additional experiments explore two distillation objectives: (i) Confidence-Weighted by Teacher Max (CWTM) and (ii) Dark Knowledge with Permuted Predictions (DKPP). Both methods elucidate the essential components of KD, demonstrating the effect of the teacher outputs on both predicted and non-predicted classes.

110. StrassenNets: Deep Learning with a Multiplication Budget [PDF] 返回目录
  ICML 2018.
  Michael Tschannen, Aran Khanna, Animashree Anandkumar
A large fraction of the arithmetic operations required to evaluate deep neural networks (DNNs) consists of matrix multiplications, in both convolution and fully connected layers. We perform end-to-end learning of low-cost approximations of matrix multiplications in DNN layers by casting matrix multiplications as 2-layer sum-product networks (SPNs) (arithmetic circuits) and learning their (ternary) edge weights from data. The SPNs disentangle multiplication and addition operations and enable us to impose a budget on the number of multiplication operations. Combining our method with knowledge distillation and applying it to image classification DNNs (trained on ImageNet) and language modeling DNNs (using LSTMs), we obtain a first-of-a-kind reduction in number of multiplications (over 99.5%) while maintaining the predictive performance of the full-precision models. Finally, we demonstrate that the proposed framework is able to rediscover Strassen’s matrix multiplication algorithm, learning to multiply $2 \times 2$ matrices using only 7 multiplications instead of 8.

111. Progressive Blockwise Knowledge Distillation for Neural Network Acceleration [PDF] 返回目录
  IJCAI 2018.
  Hui Wang, Hanbin Zhao, Xi Li, Xu Tan
As an important and challenging problem in machine learning and computer vision, neural network acceleration essentially aims to enhance the computational efficiency without sacrificing the model accuracy too much. In this paper, we propose a progressive blockwise learning scheme for teacher-student model distillation at the subnetwork block level. The proposed scheme is able to distill the knowledge of the entire teacher network by locally extracting the knowledge of each block in terms of progressive blockwise function approximation. Furthermore, we propose a structure design criterion for the student subnetwork block, which is able to effectively preserve the original receptive field from the teacher network. Experimental results demonstrate the effectiveness of the proposed scheme against the state-of-the-art approaches.

112. KDGAN: Knowledge Distillation with Generative Adversarial Networks [PDF] 返回目录
  NeurIPS 2018.
  Xiaojie Wang, Rui Zhang, Yu Sun, Jianzhong Qi
Knowledge distillation (KD) aims to train a lightweight classifier suitable to provide accurate inference with constrained resources in multi-label learning. Instead of directly consuming feature-label pairs, the classifier is trained by a teacher, i.e., a high-capacity model whose training may be resource-hungry. The accuracy of the classifier trained this way is usually suboptimal because it is difficult to learn the true data distribution from the teacher. An alternative method is to adversarially train the classifier against a discriminator in a two-player game akin to generative adversarial networks (GAN), which can ensure the classifier to learn the true data distribution at the equilibrium of this game. However, it may take excessively long time for such a two-player game to reach equilibrium due to high-variance gradient updates. To address these limitations, we propose a three-player game named KDGAN consisting of a classifier, a teacher, and a discriminator. The classifier and the teacher learn from each other via distillation losses and are adversarially trained against the discriminator via adversarial losses. By simultaneously optimizing the distillation and adversarial losses, the classifier will learn the true data distribution at the equilibrium. We approximate the discrete distribution learned by the classifier (or the teacher) with a concrete distribution. From the concrete distribution, we generate continuous samples to obtain low-variance gradient updates, which speed up the training. Extensive experiments using real datasets confirm the superiority of KDGAN in both accuracy and training speed.

113. Collaborative Learning for Deep Neural Networks [PDF] 返回目录
  NeurIPS 2018.
  Guocong Song, Wei Chai
We introduce collaborative learning in which multiple classifier heads of the same network are simultaneously trained on the same training data to improve generalization and robustness to label noise with no extra inference cost. It acquires the strengths from auxiliary training, multi-task learning and knowledge distillation. There are two important mechanisms involved in collaborative learning. First, the consensus of multiple views from different classifier heads on the same example provides supplementary information as well as regularization to each classifier, thereby improving generalization. Second, intermediate-level representation (ILR) sharing with backpropagation rescaling aggregates the gradient flows from all heads, which not only reduces training computational complexity, but also facilitates supervision to the shared layers. The empirical results on CIFAR and ImageNet datasets demonstrate that deep neural networks learned as a group in a collaborative way significantly reduce the generalization error and increase the robustness to label noise.

114. Knowledge Distillation by On-the-Fly Native Ensemble [PDF] 返回目录
  NeurIPS 2018.
  xu lan, Xiatian Zhu, Shaogang Gong
Knowledge distillation is effective to train the small and generalisable network models for meeting the low-memory and fast running requirements. Existing offline distillation methods rely on a strong pre-trained teacher, which enables favourable knowledge discovery and transfer but requires a complex two-phase training procedure. Online counterparts address this limitation at the price of lacking a high-capacity teacher. In this work, we present an On-the-fly Native Ensemble (ONE) learning strategy for one-stage online distillation. Specifically, ONE only trains a single multi-branch network while simultaneously establishing a strong teacher on-the-fly to enhance the learning of target network. Extensive evaluations show that ONE improves the generalisation performance of a variety of deep neural networks more significantly than alternative methods on four image classification dataset: CIFAR10, CIFAR100, SVHN, and ImageNet, whilst having the computational efficiency advantages.

115. Learning to Specialize with Knowledge Distillation for Visual Question Answering [PDF] 返回目录
  NeurIPS 2018.
  Jonghwan Mun, Kimin Lee, Jinwoo Shin, Bohyung Han
Visual Question Answering (VQA) is a notoriously challenging problem because it involves various heterogeneous tasks defined by questions within a unified framework. Learning specialized models for individual types of tasks is intuitively attracting but surprisingly difficult; it is not straightforward to outperform naive independent ensemble approach. We present a principled algorithm to learn specialized models with knowledge distillation under a multiple choice learning (MCL) framework, where training examples are assigned dynamically to a subset of models for updating network parameters. The assigned and non-assigned models are learned to predict ground-truth answers and imitate their own base models before specialization, respectively. Our approach alleviates the limitation of data deficiency in existing MCL frameworks, and allows each model to learn its own specialized expertise without forgetting general knowledge. The proposed framework is model-agnostic and applicable to any tasks other than VQA, e.g., image classification with a large number of labels but few per-class examples, which is known to be difficult under existing MCL schemes. Our experimental results indeed demonstrate that our method outperforms other baselines for VQA and image classification.

116. WebChild 2.0 : Fine-Grained Commonsense Knowledge Distillation [PDF] 返回目录
  ACL 2017. System Demonstrations
  Niket Tandon, Gerard de Melo, Gerhard Weikum


117. A Joint Sequential and Relational Model for Frame-Semantic Parsing [PDF] 返回目录
  EMNLP 2017.
  Bishan Yang, Tom Mitchell
We introduce a new method for frame-semantic parsing that significantly improves the prior state of the art. Our model leverages the advantages of a deep bidirectional LSTM network which predicts semantic role labels word by word and a relational network which predicts semantic roles for individual text expressions in relation to a predicate. The two networks are integrated into a single model via knowledge distillation, and a unified graphical model is employed to jointly decode frames and semantic roles during inference. Experiments on the standard FrameNet data show that our model significantly outperforms existing neural and non-neural approaches, achieving a 5.7 F1 gain over the current state of the art, for full frame structure extraction.

118. Knowledge Distillation for Bilingual Dictionary Induction [PDF] 返回目录
  EMNLP 2017.
  Ndapandula Nakashole, Raphael Flauger
Leveraging zero-shot learning to learn mapping functions between vector spaces of different languages is a promising approach to bilingual dictionary induction. However, methods using this approach have not yet achieved high accuracy on the task. In this paper, we propose a bridging approach, where our main contribution is a knowledge distillation training objective. As teachers, rich resource translation paths are exploited in this role. And as learners, translation paths involving low resource languages learn from the teachers. Our training objective allows seamless addition of teacher translation paths for any given low resource pair. Since our approach relies on the quality of monolingual word embeddings, we also propose to enhance vector representations of both the source and target language with linguistic information. Our experiments on various languages show large performance gains from our distillation training objective, obtaining as high as 17% accuracy improvements.

119. Learning Efficient Object Detection Models with Knowledge Distillation [PDF] 返回目录
  NeurIPS 2017.
  Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, Manmohan Chandraker
Despite significant accuracy improvement in convolutional neural networks (CNN) based object detectors, they often require prohibitive runtimes to process an image for real-time applications. State-of-the-art models often use very deep networks with a large number of floating point operations. Efforts such as model compression learn compact models with fewer number of parameters, but with much reduced accuracy. In this work, we propose a new framework to learn compact and fast ob- ject detection networks with improved accuracy using knowledge distillation [20] and hint learning [34]. Although knowledge distillation has demonstrated excellent improvements for simpler classification setups, the complexity of detection poses new challenges in the form of regression, region proposals and less voluminous la- bels. We address this through several innovations such as a weighted cross-entropy loss to address class imbalance, a teacher bounded loss to handle the regression component and adaptation layers to better learn from intermediate teacher distribu- tions. We conduct comprehensive empirical evaluation with different distillation configurations over multiple datasets including PASCAL, KITTI, ILSVRC and MS-COCO. Our results show consistent improvement in accuracy-speed trade-offs for modern multi-class detection models.

120. Sequence-Level Knowledge Distillation [PDF] 返回目录
  EMNLP 2016.
  Yoon Kim, Alexander M. Rush


121. FitNets: Hints for Thin Deep Nets [PDF] 返回目录
  ICLR 2015.
  Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, Yoshua Bengio
While depth tends to improve network performances, it also makesgradient-based training more difficult since deeper networks tend to be morenon-linear. The recently proposed knowledge distillation approach is aimed atobtaining small and fast-to-execute models, and it has shown that a studentnetwork could imitate the soft output of a larger teacher network or ensembleof networks. In this paper, we extend this idea to allow the training of astudent that is deeper and thinner than the teacher, using not only the outputsbut also the intermediate representations learned by the teacher as hints toimprove the training process and final performance of the student. Because thestudent intermediate hidden layer will generally be smaller than the teacher'sintermediate hidden layer, additional parameters are introduced to map thestudent hidden layer to the prediction of the teacher hidden layer. This allowsone to train deeper students that can generalize better or run faster, atrade-off that is controlled by the chosen student capacity. For example, onCIFAR-10, a deep student network with almost 10.4 times less parametersoutperforms a larger, state-of-the-art teacher network.

注:论文列表使用AC论文搜索器整理!