Dion: Distributed Orthonormalized Updates
This is an incremental improvement for large-scale language model training, addressing compute and communication bottlenecks.
The paper tackles the inefficiency of orthonormalized updates in large-scale LLM training by introducing Dion, a distributed method that reduces wall-clock time while retaining benefits like improved stability and hyperparameter transfer.
Orthonormalized updates accelerate training, improve stability, and enable robust hyperparameter transfer, but existing methods like Muon rely on dense matrix operations that clash with sharded weights in large-scale LLM training, causing high compute and communication cost. We introduce Dion (Distributed Orthonormalization), a scalable and efficient update rule that replaces Newton-Schulz iteration with amortized power iteration on a momentum buffer, avoiding full-matrix reconstruction and integrating cleanly with weight sharding. The rank-fraction parameter with error feedback enables low-rank updates that balance quality with significant cost savings. On language models from 160M to 3B parameters, Dion retains the benefits of orthonormalized updates, while markedly reducing wall-clock time at scale, making it a practical optimizer for next-generation foundation models. Code is available at: https://github.com/microsoft/dion/