LGMLOct 10, 2025

An Exploration of Non-Euclidean Gradient Descent: Muon and its Many Variants

arXiv:2510.09827v116 citationsh-index: 8
Originality Incremental advance
AI Analysis

This work addresses hyperparameter tuning challenges in optimization for neural networks, offering more robust methods that can save computational costs, though it is incremental as it builds on existing optimizers like Muon and Adam.

The paper systematically explores non-Euclidean gradient descent methods for neural networks, formalizing existing optimizers like Adam and Muon and deriving new variants such as MuonMax, which is found to be more robust to learning rate choices and achieves better validation scores when combined with momentum (Momo).

To define a steepest descent method over a neural network, we need to choose a norm for each layer, a way to aggregate these norms across layers, and whether to use normalization. We systematically explore different alternatives for aggregating norms across layers, both formalizing existing combinations of Adam and the recently proposed Muon as a type of non-Euclidean gradient descent, and deriving new variants of the Muon optimizer. Through a comprehensive experimental evaluation of the optimizers within our framework, we find that Muon is sensitive to the choice of learning rate, whereas a new variant we call MuonMax is significantly more robust. We then show how to combine any non-Euclidean gradient method with model based momentum (known as Momo). The new Momo variants of Muon are significantly more robust to hyperparameter tuning, and often achieve a better validation score. Thus for new tasks, where the optimal hyperparameters are not known, we advocate for using Momo in combination with MuonMax to save on costly hyperparameter tuning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes