LGCLJun 4, 2024

Landscape-Aware Growing: The Power of a Little LAG

arXiv:2406.02469v14 citations
AI Analysis

This work addresses the challenge of efficient model scaling for researchers and practitioners, offering an incremental improvement in strategy selection.

The paper tackles the problem of selecting the best growing strategy for efficient pretraining of Transformer models, finding that early training dynamics provide more accurate predictions of final performance than initialization behavior, enabling optimal strategy selection with minimal delay.

Recently, there has been increasing interest in efficient pretraining paradigms for training Transformer-based models. Several recent approaches use smaller models to initialize larger models in order to save computation (e.g., stacking and fusion). In this work, we study the fundamental question of how to select the best growing strategy from a given pool of growing strategies. Prior works have extensively focused on loss- and/or function-preserving behavior at initialization or simply performance at the end of training. Instead, we identify that behavior at initialization can be misleading as a predictor of final performance and present an alternative perspective based on early training dynamics, which we call "landscape-aware growing (LAG)". We perform extensive analysis of correlation of the final performance with performance in the initial steps of training and find early and more accurate predictions of the optimal growing strategy (i.e., with only a small "lag" after initialization). This perspective also motivates an adaptive strategy for gradual stacking.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes