LGApr 8, 2024

An Empirical Study of $μ$P Learning Rate Transfer

arXiv:2404.05728v66 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

This addresses the problem of costly hyperparameter tuning for deep learning practitioners by validating a method to reduce such overhead, though it is incremental as it focuses on empirical verification of an existing theoretical approach.

This paper empirically investigates whether the $μ$-Parameterization ($μ$P) method enables zero-shot transfer of near-optimal learning rates from small to large transformer models, finding it works in most settings across experiments with up to 10B parameters and 190B tokens.

Deep learning models have become a cornerstone of modern AI research, yet their initializations and learning rates may at times be set in an opaque or ad-hoc fashion due to the high cost of hyperparameter sweeps. The $μ$-Parameterization ($μ$P) offers a possible solution to this challenge, yielding scaling rules for model initialization and learning rates while reportedly enabling zero-shot hyperparameter transfer from small to large models. Despite its evident promise, the $μ$P method is not yet widely adopted, perhaps due to higher implementation complexity, many variations, or complex theoretical background. This work considers $μ$P empirically, focusing on the popular transformer architecture, and aims to answer a simple question: does $μ$-Transfer yield near-optimal learning rates in practice? Studying over a dozen ablations with up to 1.2B parameters and 33B tokens and a large-scale experiment with up to 10B parameters and 190B tokens, we observe a positive answer for most settings, and discuss improvements otherwise.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes