u-$μ$P: The Unit-Scaled Maximal Update Parametrization
This work addresses hyperparameter tuning efficiency for machine learning practitioners, offering an incremental improvement over existing methods.
The paper tackles the problem of making hyperparameter optimization independent of model size by introducing u-μP, which combines Maximal Update Parametrization with Unit Scaling to simplify training and improve efficiency. The result is that u-μP models achieve equal or lower loss compared to μP models and work out-of-the-box in FP8, enabling more efficient hyperparameter sweeping.
The Maximal Update Parametrization ($μ$P) aims to make the optimal hyperparameters (HPs) of a model independent of its size, allowing them to be swept using a cheap proxy model rather than the full-size target model. We present a new scheme, u-$μ$P, which improves upon $μ$P by combining it with Unit Scaling, a method for designing models that makes them easy to train in low-precision. The two techniques have a natural affinity: $μ$P ensures that the scale of activations is independent of model size, and Unit Scaling ensures that activations, weights and gradients begin training with a scale of one. This synthesis opens the door to a simpler scheme, whose default values are near-optimal. This in turn facilitates a more efficient sweeping strategy, with u-$μ$P models reaching a loss that is equal to or lower than comparable $μ$P models and working out-of-the-box in FP8.