LGAIMay 2, 2025

Don't be lazy: CompleteP enables compute-efficient deep transformers

arXiv:2505.01618v353 citationsh-index: 28Has Code
Originality Incremental advance
AI Analysis

This work addresses a critical bottleneck for practitioners scaling up deep transformers by enabling more efficient training and better hardware utilization, though it is incremental as it builds on existing parameterization methods.

The paper tackles the problem of compute-efficient training of large language models by identifying a parameterization called CompleteP that achieves both hyperparameter transfer across model depth and non-lazy learning in all layers, resulting in 12-34% compute efficiency improvements over prior state-of-the-art.

We study compute efficiency of LLM training when using different parameterizations, i.e., rules for adjusting model and optimizer hyperparameters (HPs) as model size changes. Some parameterizations fail to transfer optimal base HPs (such as learning rate) across changes in model depth, requiring practitioners to either re-tune these HPs as they scale up (expensive), or accept sub-optimal training when re-tuning is prohibitive. Even when they achieve HP transfer, we develop theory to show parameterizations may still exist in the lazy learning regime where layers learn only features close to their linearization, preventing effective use of depth and nonlinearity. Finally, we identify and adopt the parameterization we call CompleteP that achieves both depth-wise HP transfer and non-lazy learning in all layers. CompleteP enables a wider range of model width/depth ratios to remain compute-efficient, unlocking shapes better suited for different hardware settings and operational contexts. Moreover, CompleteP enables 12-34% compute efficiency improvements over the prior state-of-the-art. All experiments were run on Cerebras CS-3 systems. A minimal implementation is available at https://github.com/EleutherAI/nanoGPT-mup/tree/completep.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes