LGAug 13, 2025

$μ$-Parametrization for Mixture of Experts

arXiv:2508.09752v23 citationsh-index: 6Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of efficient hyperparameter tuning for large-scale MoE models, which is crucial for reducing costs in training extremely large models, though it is incremental as it extends existing μTransfer techniques from dense to MoE architectures.

The paper tackles the problem of hyperparameter tuning for large-scale Mixture-of-Experts (MoE) models, which is prohibitively expensive, by deriving a μ-Parameterization for MoE that provides theoretical guarantees and demonstrates reliable transfer of optimal learning rates across model sizes.

Recent years have seen a growing interest and adoption of LLMs, with Mixture-of-Experts (MoE) emerging as a leading architecture in extremely large models. Currently, the largest open-source models reach over $1$T parameters. At such scales, hyperparameter tuning becomes prohibitively expensive. Precisely for this reason, the $μ$Transfer is becoming a key technique. It allows for seamless transfer of optimal hyperparameters across model scales, resulting in a huge reduction in tuning costs. However, existing work has primarily focused on dense LLMs, leaving MoE architectures unexplored. In this work, we derive a $μ$-Parameterization for MoE, providing theoretical guarantees for feature learning across model widths. Our experiments demonstrate that the optimal learning rate reliably transfers across model sizes, establishing a foundation for efficient hyperparameter tuning in large-scale MoE models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes