TEON: Tensorized Orthonormalization Beyond Layer-Wise Muon for Large Language Model Pre-Training
This work addresses optimization bottlenecks for researchers and practitioners training large language models, offering an incremental but principled enhancement over existing methods.
The paper tackles the problem of improving gradient orthogonalization in large language model pre-training by proposing TEON, a tensorized generalization of the Muon optimizer that extends beyond individual layers, resulting in consistent improvements in training and validation perplexity across models ranging from 60M to 1B parameters.
The Muon optimizer has demonstrated strong empirical performance in pre-training large language models by performing matrix-level gradient (or momentum) orthogonalization in each layer independently. In this work, we propose TEON, a principled generalization of Muon that extends orthogonalization beyond individual layers by modeling the gradients of a neural network as a structured higher-order tensor. We present TEON's improved convergence guarantee over layer-wise Muon, and further develop a practical instantiation of TEON based on the theoretical analysis with corresponding ablation. We evaluate our approach on two widely adopted architectures: GPT-style models, ranging from 130M to 774M parameters, and LLaMA-style models, ranging from 60M to 1B parameters. Experimental results show that TEON consistently improves training and validation perplexity across model scales and exhibits strong robustness under various approximate SVD schemes.