Early-stopping for Transformer model training
This work addresses the challenge of efficient and principled training for machine learning practitioners, though it is incremental as it builds on existing early-stopping methods with a new theoretical approach.
The paper tackles the problem of determining when to stop training Transformer models by developing a theoretical framework based on Random Matrix Theory to analyze training dynamics, resulting in two validation-free early-stopping criteria that align strongly with observed spectral changes.
This work introduces a novel theoretical framework grounded in Random Matrix Theory (RMT) for analyzing Transformer training dynamics. We focus on the underlying mechanisms that drive performance improvements and derive principled early-stopping criteria. Empirically, we observe that the spectral density of the shallow self-attention matrix V consistently evolves into a heavy-tailed distribution. Utilizing the PL (Power Law) fit to this matrix as a probe, we demarcate training into three stages: structural exploration, heavy-tailed structure stabilization, and convergence saturation. This staging provides guidance for preliminary stopping decisions. Crucially, we propose two consistent and validation-free criteria: a quantitative metric for heavy-tailed dynamics and a novel spectral signature indicative of convergence. The strong alignment between these criteria highlights the utility of RMT for monitoring and diagnosing the progression of Transformer model training.