Understanding Gradient Orthogonalization for Deep Learning via Non-Euclidean Trust-Region Optimization
This work addresses the theoretical understanding of optimization methods for deep learning researchers, offering insights into practical observations like optimizer performance and weight decay effects, but it is incremental as it builds on prior work.
The paper provides a theoretical analysis of gradient orthogonalization in deep learning, showing it as a first-order trust-region method and developing a stochastic non-Euclidean trust-region gradient method with momentum, which recovers existing optimizers and proves state-of-the-art convergence results for various scenarios.
Optimization with matrix gradient orthogonalization has recently demonstrated impressive results in the training of deep neural networks (Jordan et al., 2024; Liu et al., 2025). In this paper, we provide a theoretical analysis of this approach. In particular, we show that the orthogonalized gradient method can be seen as a first-order trust-region optimization method, where the trust-region is defined in terms of the matrix spectral norm. Motivated by this observation, we develop the stochastic non-Euclidean trust-region gradient method with momentum, which recovers the Muon optimizer (Jordan et al., 2024) as a special case, along with normalized SGD and signSGD with momentum (Cutkosky and Mehta, 2020; Sun et al., 2023). In addition, we prove state-of-the-art convergence results for the proposed algorithm in a range of scenarios, which involve arbitrary non-Euclidean norms, constrained and composite problems, and non-convex, star-convex, first- and second-order smooth functions. Finally, our theoretical findings provide an explanation for several practical observations, including the practical superiority of Muon compared to the Orthogonal-SGDM algorithm of Tuddenham et al. (2022) and the importance of weight decay in the training of large-scale language models.