LGAug 13, 2025

$μ$-Parametrization for Mixture of Experts

Jan Małaśnicki, Kamil Ciebiera, Mateusz Boruń, Maciej Pióro, Jan Ludziejewski, Maciej Stefaniak, Michał Krutul, Sebastian Jaszczur, Marek Cygan, Kamil Adamczewski, Jakub Krajewski

arXiv:2508.09752v23 citationsh-index: 6Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of efficient hyperparameter tuning for large-scale MoE models, which is crucial for reducing costs in training extremely large models, though it is incremental as it extends existing μTransfer techniques from dense to MoE architectures.

The paper tackles the problem of hyperparameter tuning for large-scale Mixture-of-Experts (MoE) models, which is prohibitively expensive, by deriving a μ-Parameterization for MoE that provides theoretical guarantees and demonstrates reliable transfer of optimal learning rates across model sizes.

Recent years have seen a growing interest and adoption of LLMs, with Mixture-of-Experts (MoE) emerging as a leading architecture in extremely large models. Currently, the largest open-source models reach over $1$T parameters. At such scales, hyperparameter tuning becomes prohibitively expensive. Precisely for this reason, the $μ$Transfer is becoming a key technique. It allows for seamless transfer of optimal hyperparameters across model scales, resulting in a huge reduction in tuning costs. However, existing work has primarily focused on dense LLMs, leaving MoE architectures unexplored. In this work, we derive a $μ$-Parameterization for MoE, providing theoretical guarantees for feature learning across model widths. Our experiments demonstrate that the optimal learning rate reliably transfers across model sizes, establishing a foundation for efficient hyperparameter tuning in large-scale MoE models.

View on arXiv PDF

Similar