LG AI OCApr 7, 2025

Dion: Distributed Orthonormalized Updates

Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, John Langford

arXiv:2504.05295v336.449 citationsh-index: 12Has Code

Originality Incremental advance

AI Analysis

This is an incremental improvement for large-scale language model training, addressing compute and communication bottlenecks.

The paper tackles the inefficiency of orthonormalized updates in large-scale LLM training by introducing Dion, a distributed method that reduces wall-clock time while retaining benefits like improved stability and hyperparameter transfer.

Orthonormalized updates accelerate training, improve stability, and enable robust hyperparameter transfer, but existing methods like Muon rely on dense matrix operations that clash with sharded weights in large-scale LLM training, causing high compute and communication cost. We introduce Dion (Distributed Orthonormalization), a scalable and efficient update rule that replaces Newton-Schulz iteration with amortized power iteration on a momentum buffer, avoiding full-matrix reconstruction and integrating cleanly with weight sharding. The rank-fraction parameter with error feedback enables low-rank updates that balance quality with significant cost savings. On language models from 160M to 3B parameters, Dion retains the benefits of orthonormalized updates, while markedly reducing wall-clock time at scale, making it a practical optimizer for next-generation foundation models. Code is available at: https://github.com/microsoft/dion/

View on arXiv PDF Code

Similar