AIApr 5

Preservation Is Not Enough for Width Growth: Regime-Sensitive Selection of Dense LM Warm Starts

arXiv:2604.0428111.51 citations
AI Analysis

This work addresses a specific, incremental challenge in efficiently reusing smaller model checkpoints for width growth in language models.

The study tackled the problem of selecting warm starts for width expansion in language models, finding that zero-step preservation alone is insufficient and that the best selection depends on the training regime and budget, with exact-copy warm starts performing best in stochastic continuations and structured non-clone ones winning in deterministic settings.

Width expansion offers a practical route to reuse smaller causal-language-model checkpoints, but selecting a widened warm start is not solved by zero-step preservation alone. We study dense width growth as a candidate-selection problem over full training states, including copied weights, optimizer moments, and scheduler state. In a small-scale TinyStories proxy, we compare exact-copy, perturbative, asymmetric-reset, and structured non-clone warm starts under matched continuation budgets. We evaluate zero-step preservation, short-lag probe metrics, and downstream continuation utility in deterministic and stochastic regimes. The picture is mixed and partially replicated through a reduced-pool seed-1 check. Exact-copy symmetric warm starts rank first in every completed 16-step probe and in the completed stochastic 128-step continuations at seed-0 steps 1000 and 2000 plus reduced seed-1 step 2000. By contrast, the structured non-clone challenger wins deterministic 128-step continuation. Early escape from the inherited cloned subspace is therefore not a universal selector: it helps in long deterministic continuation, but it misleads at short lag and under stochastic continuation. The result is narrow but useful: for dense width growth at this scale, preservation is not a universal ranking criterion, and the best replacement signal depends on both regime and lag budget.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes