CLAILGApr 15, 2025

A Dual-Space Framework for General Knowledge Distillation of Large Language Models

arXiv:2504.11426v12 citationsh-index: 39Has Code
Originality Incremental advance
AI Analysis

This work addresses a key bottleneck in compressing LLMs for deployment, enabling more efficient distillation across models with different architectures, which is incremental but practical for real-world applications.

The paper tackles the limitations of white-box knowledge distillation (KD) for large language models (LLMs), which include output space mismatches and incompatibility with different vocabularies, by proposing a dual-space KD framework (DSKD) that unifies prediction heads and aligns tokens; experiments show DSKD significantly outperforms existing methods on benchmarks like instruction-following, mathematical reasoning, and code generation.

Knowledge distillation (KD) is a promising solution to compress large language models (LLMs) by transferring their knowledge to smaller models. During this process, white-box KD methods usually minimize the distance between the output distributions of the teacher model and the student model to transfer more information. However, we reveal that the current white-box KD framework exhibits two limitations: a) bridging probability distributions from different output spaces will limit the similarity between the teacher model and the student model; b) this framework cannot be applied to LLMs with different vocabularies. One of the root causes for these limitations is that the distributions from the teacher and the student for KD are output by different prediction heads, which yield distributions in different output spaces and dimensions. Therefore, in this paper, we propose a dual-space knowledge distillation (DSKD) framework that unifies the prediction heads of the teacher and the student models for KD. Specifically, we first introduce two projectors with ideal initialization to project the teacher/student hidden states into the student/teacher representation spaces. After this, the hidden states from different models can share the same head and unify the output spaces of the distributions. Furthermore, we develop an exact token alignment (ETA) algorithm to align the same tokens in two differently-tokenized sequences. Based on the above, our DSKD framework is a general KD framework that supports both off-policy and on-policy KD, and KD between any two LLMs regardless of their vocabularies. Extensive experiments on instruction-following, mathematical reasoning, and code generation benchmarks show that DSKD significantly outperforms existing methods based on the current white-box KD framework and surpasses other cross-tokenizer KD methods for LLMs with different vocabularies.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes