CLAISep 8, 2025

Crown, Frame, Reverse: Layer-Wise Scaling Variants for LLM Pre-Training

arXiv:2509.06518v1
Originality Incremental advance
AI Analysis

This work addresses architectural efficiency in LLM pre-training for NLP researchers, though it's incremental building on existing Layer-Wise Scaling and pruning literature.

The authors tackled the problem of uniform layer sizes in transformer-based language models by introducing three new Layer-Wise Scaling variants (Framed, Reverse, Crown) that redistribute FFN widths and attention heads via linear interpolation during pre-training. On a fixed budget of 180M parameters trained on 5B tokens, all variants achieved better performance than an equal-cost isotropic baseline with similar convergence losses and no substantial training throughput decrease.

Transformer-based language models traditionally use uniform (isotropic) layer sizes, yet they ignore the diverse functional roles that different depths can play and their computational capacity needs. Building on Layer-Wise Scaling (LWS) and pruning literature, we introduce three new LWS variants - Framed, Reverse, and Crown - that redistribute FFN widths and attention heads via two or three-point linear interpolation in the pre-training stage. We present the first systematic ablation of LWS and its variants, on a fixed budget of 180M parameters, trained on 5B tokens. All models converge to similar losses and achieve better performance compared to an equal-cost isotropic baseline, without a substantial decrease in training throughput. This work represents an initial step into the design space of layer-wise architectures for pre-training, but future work should scale experiments to orders of magnitude more tokens and parameters to fully assess their potential.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes