CLAIOct 21, 2024

Pre-training Distillation for Large Language Models: A Design Space Exploration

Tsinghua
arXiv:2410.16215v117 citationsh-index: 18ACL
Originality Incremental advance
AI Analysis

This work addresses the challenge of efficiently training smaller language models for researchers and practitioners, but it is incremental as it builds on existing knowledge distillation methods by exploring pre-training adaptations.

The paper tackles the problem of knowledge distillation for large language models by extending it to the pre-training phase, finding that larger student models benefit more from this approach while larger teachers do not always improve results, with experiments validating effectiveness using a 1.9B parameter student distilled from a GLM-4-9B teacher.

Knowledge distillation (KD) aims to transfer knowledge from a large teacher model to a smaller student model. Previous work applying KD in the field of large language models (LLMs) typically focused on the post-training phase, where the student LLM learns directly from instructions and corresponding responses generated by the teacher model. In this paper, we extend KD to the pre-training phase of LLMs, named pre-training distillation (PD). We first conduct a preliminary experiment using GLM-4-9B as the teacher LLM to distill a 1.9B parameter student LLM, validating the effectiveness of PD. Considering the key impact factors of distillation, we systematically explore the design space of pre-training distillation across four aspects: logits processing, loss selection, scaling law, and offline or online logits. We conduct extensive experiments to explore the design space of pre-training distillation and find better configurations and interesting conclusions, such as larger student LLMs generally benefiting more from pre-training distillation, while a larger teacher LLM does not necessarily guarantee better results. We hope our exploration of the design space will inform future practices in pre-training distillation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes