LGMay 9

When and Why Grouping Attention Heads Accelerates Muon Optimization

Hongtao Zhang, Wenjie Zhou, Wei Chen, Xueqi Cheng

arXiv:2605.0893315.4

Predicted impact top 18% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners training transformer models, this work provides a practical optimizer hyperparameter (head grouping) that improves convergence over existing Muon variants.

Muon optimization for attention projections suffers from a granularity mismatch between full-matrix and per-head updates. The authors propose Group Muon, which groups attention heads to balance whitening gain and norm cost, achieving improved validation loss on GPT-2 Small trained on FineWeb compared to both full-matrix and per-head Muon.

Muon orthogonalizes matrix updates, but multi-head attention naturally operates at the level of heads. This granularity mismatch raises the question of whether Muon should be applied to the full attention projection, to individual heads, or to intermediate head groups. We study this question through a one-step descent comparison between full-matrix Muon and group-wise Muon. Our analysis reveals a trade-off between the \textbf{group-wise whitening gain} from group-wise updates and the \textbf{grouping-induced norm cost}, an additional update-norm cost caused by replacing full-matrix whitening with group-wise whitening. Motivated by this trade-off, we propose \textbf{Group Muon}, which treats head group size and grouping rule as optimizer hyperparameters. On GPT-2 Small trained on FineWeb, appropriate grouping improves validation loss over both full-QKV Muon and fully head-wise MuonSplit.

View on arXiv PDF

Similar