FlyKD: Graph Knowledge Distillation on the Fly with Curriculum Learning
This work addresses efficiency and deployment challenges in machine learning by improving knowledge distillation for graph models, though it is incremental in nature.
The paper tackles the problem of limited pseudo-label generation and noisy optimization in knowledge distillation by proposing FlyKD, which enables virtually unlimited pseudo-label generation and uses curriculum learning to improve optimization. Empirically, FlyKD outperforms vanilla KD and LSPGCN, achieving specific gains as detailed in the results.
Knowledge Distillation (KD) aims to transfer a more capable teacher model's knowledge to a lighter student model in order to improve the efficiency of the model, making it faster and more deployable. However, the student model's optimization process over the noisy pseudo labels (generated by the teacher model) is tricky and the amount of pseudo labels one can generate is limited due to Out of Memory (OOM) error. In this paper, we propose FlyKD (Knowledge Distillation on the Fly) which enables the generation of virtually unlimited number of pseudo labels, coupled with Curriculum Learning that greatly alleviates the optimization process over the noisy pseudo labels. Empirically, we observe that FlyKD outperforms vanilla KD and the renown Local Structure Preserving Graph Convolutional Network (LSPGCN). Lastly, with the success of Curriculum Learning, we shed light on a new research direction of improving optimization over noisy pseudo labels.