LG CVMay 23

Muon in Vision Transformers: Optimizer-Recipe Interactions and Gradient Spectra

Ben S. Southworth, Shuai Jiang, Daniel McBride, Eric C. Cyr, Stephen Thomas

arXiv:2605.2477011.3

Predicted impact top 47% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For researchers training ViTs, this work identifies that Muon's effectiveness is recipe-dependent and provides insights into gradient spectral dynamics, though the findings are incremental as they extend existing optimizer analysis to a new domain.

Muon optimizer outperforms AdamW in training Vision Transformers (ViTs), especially with heavy data augmentation, achieving large gains on long-tailed datasets like Pl@ntNet-300K. The study reveals that Muon's advantage stems from broader gradient spectra in attention projections, while heavy augmentation prevents spectral collapse in deep feedforward blocks.

Muon is a recently developed matrix-aware optimizer that has shown strong results in transformer training, but its behavior in vision transformers (ViTs) is not yet well understood. We study Muon for ViT training, largely on ImageNet-100 and Pl@ntNet-300K, comparing against AdamW under standard vision recipes involving mixup, cutmix, smoothing, and random augmentation and erasing. Muon consistently outperforms AdamW, with especially large gains on long-tailed Pl@ntNet macro top-1. These gains are also recipe-dependent, where Muon benefits much more than AdamW from advanced and significant data augmentation techniques. To understand this interaction, we analyze the singular-value structure of matrix gradients throughout the ViT. Within Muon training runs, removing heavy data augmentation induces a late-training spectral concentration and mode collapse in gradient matrices, primarily in deep MLP-down blocks. Under a fixed "full" augmentation recipe, the clearest Muon-AdamW contrast appears instead in QKV gradients, where AdamW gradient energy remains concentrated in a much narrower basis while Muon spreads energy across substantially more singular modes. Muon in ViTs is therefore best understood as an optimizer-recipe interaction. Under a fixed recipe, Muon differs from AdamW most clearly in attention projections, where its gradients consist of a broader spectral basis. Within Muon, a full training recipe is important for preventing late spectral concentration and mode collapse in deep feedforward blocks. We further demonstrate efficacy in training ViTs on image segmentation and masked autoencoder models, where Muon outperforms AdamW in all settings considered.

View on arXiv PDF

Similar