Predicting Plasticity in Deep Continual Learning: A Theoretical Perspective
For researchers in deep continual learning, this work provides a more reliable metric to predict plasticity loss, though it is an incremental improvement over existing diagnostics.
The paper shows that existing plasticity diagnostics (e.g., representation rank, NTK rank) can fail to predict trainability in continual learning, and proposes a new metric called optimization readiness that theoretically lower-bounds one-step optimization gain and empirically outperforms prior diagnostics in ranking checkpoints by trainability.
Deep continual learning requires models to adapt to new tasks without retraining from scratch. However, neural networks can lose their ability to adapt to new tasks after training on previous ones, a phenomenon known as loss of plasticity. There have been several explanations and diagnostics proposed for plasticity loss. Motivated by the philosophy that "all models are wrong, but some are useful", we ask: can existing diagnostics predict a neural network's plasticity? In this work, we take a practical view to interpret plasticity as trainability, i.e., a neural network's future optimization gain on a target task. We first take a theoretical approach, showing, by constructing a few counterexamples, that some widely adopted diagnostics of plasticity, including representation rank and neural tangent kernel rank, can fail to predict the loss of trainability in both regression and classification settings. We instead propose a novel metric, called optimization readiness, which combines gradient strength and gradient reliability. We prove that optimization readiness lower bounds one-step optimization gain under standard smoothness assumptions, providing a theoretical guarantee for its predictive power. Empirically, we show that across commonly used deep continual learning settings, such as Slowly-Changing Regression and Permuted MNIST, optimization readiness more reliably ranks checkpoints by trainability than prior diagnostics, even with substantially fewer samples.