Crown, Frame, Reverse: Layer-Wise Scaling Variants for LLM Pre-Training
This work addresses architectural efficiency in LLM pre-training for NLP researchers, though it's incremental building on existing Layer-Wise Scaling and pruning literature.
The authors tackled the problem of uniform layer sizes in transformer-based language models by introducing three new Layer-Wise Scaling variants (Framed, Reverse, Crown) that redistribute FFN widths and attention heads via linear interpolation during pre-training. On a fixed budget of 180M parameters trained on 5B tokens, all variants achieved better performance than an equal-cost isotropic baseline with similar convergence losses and no substantial training throughput decrease.
Transformer-based language models traditionally use uniform (isotropic) layer sizes, yet they ignore the diverse functional roles that different depths can play and their computational capacity needs. Building on Layer-Wise Scaling (LWS) and pruning literature, we introduce three new LWS variants - Framed, Reverse, and Crown - that redistribute FFN widths and attention heads via two or three-point linear interpolation in the pre-training stage. We present the first systematic ablation of LWS and its variants, on a fixed budget of 180M parameters, trained on 5B tokens. All models converge to similar losses and achieve better performance compared to an equal-cost isotropic baseline, without a substantial decrease in training throughput. This work represents an initial step into the design space of layer-wise architectures for pre-training, but future work should scale experiments to orders of magnitude more tokens and parameters to fully assess their potential.