A Unified Noise-Curvature View of Loss of Trainability
This addresses a critical issue in continual learning for AI systems that need to adapt to evolving tasks, though it is incremental as it builds on existing optimization analysis.
The paper tackles the problem of loss of trainability in continual learning, where gradient steps fail to improve accuracy as tasks evolve, by introducing a per-layer predictive threshold based on gradient-noise and curvature volatility bounds, which stabilizes training and improves accuracy across various methods like CReLU, Wasserstein regularization, and L2 weight decay.
Loss of trainability (LoT) in continual learning occurs when gradient steps no longer yield improvement as tasks evolve, so accuracy stalls or degrades despite adequate capacity and supervision. We analyze LoT incurred with Adam through an optimization lens and find that single indicators such as Hessian rank, sharpness level, weight or gradient norms, gradient-to-parameter ratios, and unit-sign entropy are not reliable predictors. Instead we introduce two complementary criteria: a batch-size-aware gradient-noise bound and a curvature volatility-controlled bound that combine into a per-layer predictive threshold that anticipates trainability behavior. Using this threshold, we build a simple per-layer scheduler that keeps each layers effective step below a safe limit, stabilizing training and improving accuracy across concatenated ReLU (CReLU), Wasserstein regularization, and L2 weight decay, with learned learning-rate trajectories that mirror canonical decay.