LG CLMar 19

Anatomical Heterogeneity in Transformer Language Models

arXiv:2603.193485.0

Predicted impact top 95% in LG · last 90 daysOriginality Highly original

AI Analysis

This addresses inefficiencies in training transformer language models for AI researchers and practitioners, offering a method to reduce computational costs while improving performance.

The paper challenges the assumption of layer homogeneity in transformer language models by empirically analyzing SmolLM2-135M, revealing profound anatomical heterogeneity including a 10^7 range in layer importance and anti-layers that improve performance when removed. It proposes Growth Transformer Training, which allocates computational budget by layer importance to achieve a ~54% cost reduction and 4.7x lower validation loss compared to uniform training.

Current transformer language models are trained with uniform computational budgets across all layers, implicitly assuming layer homogeneity. We challenge this assumption through empirical analysis of SmolLM2-135M, a 30-layer, 135M-parameter causal language model, using five diagnostic metrics: weight predictability (R2), ablation degradation, recovery speed, weight manipulation robustness, and structural analysis. We find profound anatomical heterogeneity: (1) Layer weights follow strong mathematical regularity (R2 = 0.91) with a universal oscillatory delta pattern (correlation ~= -0.50), yet predicted weights cause catastrophic failure due to nonlinear error accumulation. (2) Layer importance spans a 10^7 range, from a critical core (L8-11, up to +63,419% PPL degradation) to anti-layers (L14, L17) whose removal improves performance. (3) Recovery speed correlates with layer importance, indicating differential training requirements. (4) Only weight scaling (alpha = 0.9) preserves model quality among five tested manipulation strategies. (5) Growth Transformer Training, allocating budget by layer importance, achieves ~54% cost reduction. A proof-of-concept experiment confirms this: 4.7x lower validation loss than uniform training at identical parameter count, while being 13% faster.

View on arXiv PDF

Similar