LGAIApr 5

Geometric Limits of Knowledge Distillation: A Minimum-Width Theorem via Superposition Theory

arXiv:2604.0403710.81 citations
AI Analysis

This addresses the problem of predicting distillation limits for practitioners, but it is incremental as it builds on existing superposition theory to explain known saturation.

The paper tackles the performance saturation in knowledge distillation by showing it is due to geometric limits: a student network of width d_S can encode at most d_S * g(α) features, leading to a loss floor from permanent feature loss. They validate this with experiments on a toy model and Pythia-410M, achieving high accuracy and confirming predicted floor ordering with R^2 = 0.993.

Knowledge distillation compresses large teachers into smaller students, but performance saturates at a loss floor that persists across training methods and objectives. We argue this floor is geometric: neural networks represent far more features than dimensions through superposition, and a student of width $d_S$ can encode at most $d_S \cdot g(α)$ features, where $g(α) = 1/((1-α)\ln\frac{1}{1-α})$ is a sparsity-dependent capacity function. Features beyond this budget are permanently lost, yielding an importance-weighted loss floor. We validate on a toy model (48 configurations, median accuracy >93%) and on Pythia-410M, where sparse autoencoders measure $F \approx 28{,}700$ features at $α\approx 0.992$ (critical width $d_S^* \approx 1{,}065$). Distillation into five student widths confirms the predicted monotonic floor ordering. The observed floor decomposes into a geometric component and a width-independent architectural baseline ($R^2 = 0.993$). Linear probing shows coarse concepts survive even 88% feature loss, revealing the floor arises from aggregate loss of fine-grained features in the importance distribution's long tail. Our results connect representation geometry to distillation limits and provide a practical tool for predicting distillation performance from SAE measurements alone.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes