Can Muon Fine-tune Adam-Pretrained Models?
Addresses the practical problem of optimizer mismatch when fine-tuning widely-used Adam-pretrained models with Muon, offering a mitigation strategy via LoRA.
Muon, an efficient alternative to Adam for pretraining, underperforms when naively used for fine-tuning Adam-pretrained models due to optimizer mismatch. Using LoRA reduces the performance gap between Adam and Muon across language and vision tasks.
Muon has emerged as an efficient alternative to Adam for pretraining, yet remains underused for fine-tuning. A key obstacle is that most open models are pretrained with Adam, and naively switching to Muon for fine-tuning leads to degraded performance due to an optimizer mismatch. We investigate this mismatch through controlled experiments and relate it to the distinct implicit biases of Adam and Muon. We provide evidence that the mismatch disrupts pretrained knowledge, and that this disruption scales with update strength. This leads us to hypothesize that constraining updates should mitigate the mismatch. We validate this with LoRA: across language and vision tasks, LoRA reduces the performance gap between Adam and Muon observed under full fine-tuning. Studies on LoRA rank, catastrophic forgetting, and LoRA variants further confirm that mismatch severity correlates with update strength. These results shed light on how optimizer mismatch affects fine-tuning and how it can be mitigated. Our code is available at https://github.com/XingyuQu/muon-finetune.