StableKD: Breaking Inter-block Optimization Entanglement for Stable Knowledge Distillation
It addresses accuracy drops and slow training in model compression for machine learning practitioners, offering a novel method rather than an incremental improvement.
The paper tackles the instability and slow convergence in knowledge distillation by identifying Inter-Block Optimization Entanglement and proposing StableKD, which improves accuracy by 1-18%, speeds up convergence up to 10 times, and achieves better performance with only 40% of training data.
Knowledge distillation (KD) has been recognized as an effective tool to compress and accelerate models. However, current KD approaches generally suffer from an accuracy drop and/or an excruciatingly long distillation process. In this paper, we tackle the issue by first providing a new insight into a phenomenon that we call the Inter-Block Optimization Entanglement (IBOE), which makes the conventional end-to-end KD approaches unstable with noisy gradients. We then propose StableKD, a novel KD framework that breaks the IBOE and achieves more stable optimization. StableKD distinguishes itself through two operations: Decomposition and Recomposition, where the former divides a pair of teacher and student networks into several blocks for separate distillation, and the latter progressively merges them back, evolving towards end-to-end distillation. We conduct extensive experiments on CIFAR100, Imagewoof, and ImageNet datasets with various teacher-student pairs. Compared to other KD approaches, our simple yet effective StableKD greatly boosts the model accuracy by 1% ~ 18%, speeds up the convergence up to 10 times, and outperforms them with only 40% of the training data.