CLAIMay 12

A Study on Hidden Layer Distillation for Large Language Model Pre-Training

arXiv:2605.1151343.2
AI Analysis

For researchers in LLM compression, this work provides a controlled benchmark showing that HLD offers marginal perplexity improvements but no downstream gains over standard KD.

The paper investigates hidden layer distillation (HLD) for decoder-only LLM pre-training, finding that while HLD does not consistently outperform logit-based KD on downstream tasks, it yields systematic perplexity gains across all configurations, suggesting latent potential.

Knowledge Distillation (KD) is a critical tool for training Large Language Models (LLMs), yet the majority of research focuses on approaches that rely solely on output logits, neglecting semantic information in the teacher's intermediate representations. While Hidden Layer Distillation (HLD) showed potential for encoder architectures, its application to decoder-only pre-training at scale remains largely unexplored. Through compute-controlled experiments, we benchmark HLD against logit-based KD and self-supervised baselines with Gemma3 3.4B as teacher and 123M and 735M students trained on up to 168B tokens from the C4 dataset. Our experiments show that HLD does not consistently outperform standard KD on downstream evaluation tasks. Nevertheless, we show that HLD can yield a systematic perplexity gain over KD across all shared-hyperparameter configurations, suggesting that a latent signal can be extracted, but a breakthrough may be needed for it to play a more significant role in LLM pre-training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes