LGMLApr 6, 2025

From Continual Learning to SGD and Back: Better Rates for Continual Linear Models

arXiv:2504.04579v27 citationsh-index: 74
Originality Incremental advance
AI Analysis

This work addresses catastrophic forgetting in continual learning for machine learning practitioners, offering theoretical improvements and new insights into randomization effects, though it is incremental as it builds on existing continual linear model frameworks.

The paper tackles the problem of catastrophic forgetting in continual learning by analyzing forgetting rates for overparameterized linear models, proving that fitting a task is equivalent to a single SGD step and establishing improved universal rates, such as reducing the best existing rate from O((d-r)/k) to O(min(k^{-1/4}, sqrt(d-r)/k, sqrt(Tr)/k)) for regression with replacement and providing the first rate for random orderings without replacement.

We theoretically study the common continual learning setup where an overparameterized model is sequentially fitted to a set of jointly realizable tasks. We analyze the forgetting, i.e., loss on previously seen tasks, after $k$ iterations. For continual linear models, we prove that fitting a task is equivalent to a single stochastic gradient descent (SGD) step on a modified objective. We develop novel last-iterate SGD upper bounds in the realizable least squares setup, which we then leverage to derive new results for continual learning. Focusing on random orderings over $T$ tasks, we establish universal forgetting rates, whereas existing rates depend on the problem dimensionality or complexity. Specifically, in continual regression with replacement, we improve the best existing rate from $O((d-r)/k)$ to $O(\min(k^{-1/4}, \sqrt{d-r}/k, \sqrt{Tr}/k))$, where $d$ is the dimensionality and $r$ the average task rank. Furthermore, we establish the first rate for random task orderings without replacement. The obtained rate of $O(\min(T^{-1/4}, (d-r)/T))$ proves for the first time that randomization alone, with no task repetition, can prevent catastrophic forgetting in sufficiently long task sequences. Finally, we prove a matching $O(k^{-1/4})$ forgetting rate for continual linear classification on separable data. Our universal rates apply for broader projection methods, such as block Kaczmarz and POCS, illuminating their loss convergence under i.i.d. and one-pass orderings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes