LGAIFeb 26

Residual Koopman Spectral Profiling for Predicting and Preventing Transformer Training Instability

arXiv:2602.22988v1h-index: 20
Originality Highly original
AI Analysis

This work is significant for machine learning practitioners and researchers, as it provides a tool to predict and prevent transformer training instability, thereby saving significant computational resources and enabling more aggressive training schedules.

This paper addresses the problem of transformer training instability, which leads to wasted computational resources. The authors developed Residual Koopman Spectral Profiling (RKSP), a method that predicts training divergence with an AUROC of 0.995 from a single forward pass at initialization. They also introduced Koopman Spectral Shaping (KSS), which reduces divergence rates from 66.7% to 12.5% and enables 50% to 150% higher learning rates in challenging regimes.

Training divergence in transformers wastes compute, yet practitioners discover instability only after expensive runs begin. They therefore need an expected probability of failure for a transformer before training starts. Our study of Residual Koopman Spectral Profiling (RKSP) provides such an estimate. From a single forward pass at initialization, RKSP extracts Koopman spectral features by applying whitened dynamic mode decomposition to layer-wise residual snapshots. Our central diagnostic, the near-unit spectral mass, quantifies the fraction of modes concentrated near the unit circle, which captures instability risk. For predicting divergence across extensive configurations, this estimator achieves an AUROC of 0.995, outperforming the best gradient baseline. We further make this diagnostic actionable through Koopman Spectral Shaping (KSS), which reshapes spectra during training. We empirically validate that our method works in practice: RKSP predicts divergence at initialization, and when RKSP flags high risk, turning on KSS successfully prevents divergence. In the challenging high learning rate regime without normalization layers, KSS reduces the divergence rate from 66.7% to 12.5% and enables learning rates that are 50% to 150% higher. These findings generalize to WikiText-103 language modeling, vision transformers on CIFAR-10, and pretrained language models, including GPT-2 and LLaMA-2 up to 7B, as well as emerging architectures such as MoE, Mamba-style SSMs, and KAN.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes