LLM-NEO: Parameter Efficient Knowledge Distillation for Large Language Models
This work addresses model compression for LLMs, offering a more efficient distillation method, but it is incremental as it builds on existing KD and LoRA techniques.
The paper tackles the problem of compressing large language models (LLMs) by proposing LLM-NEO, a parameter-efficient knowledge distillation method that integrates Low-Rank Adaptation (LoRA) to improve knowledge transfer efficiency, and it shows experimental results outperforming baselines on compressing Llama 2 and Llama 3.2.
Knowledge distillation (KD) has been a predominant method for compressing Large Language Models (LLMs). In this paper, we first revisit KD and Low-Rank Adaption (LoRA) and demonstrate that they follow the same paradigm. Inspired by this observation, we propose a parameter-efficient knowledge distillation method, LLM-NEO, which integrates LoRA into KD to improve the efficiency of knowledge transfer. After that, we summarize some valuable guidelines for the hyperparameters in LLM-NEO. Experimental results on compressing Llama 2 and Llama 3.2 show that LLM-NEO outperforms various baselines. Further analysis demonstrates the robustness of the proposed LLM-NEO on variants of LoRA. The code and trained models are available at [Github](https://github.com/yang3121099/LLM-Neo).