Optimal L2 Regularization in High-dimensional Continual Linear Regression
This work provides theoretical insights and a practical recipe for designing continual learning systems, addressing a known bottleneck in regularization for multiple tasks, though it is incremental in extending prior theoretical analyses.
The paper tackles the problem of generalization in overparameterized continual linear regression with L2 regularization, deriving a closed-form expression for expected generalization loss and proving that the optimal regularization strength scales as T/ln(T) with the number of tasks, which was validated through experiments.
We study generalization in an overparameterized continual linear regression setting, where a model is trained with L2 (isotropic) regularization across a sequence of tasks. We derive a closed-form expression for the expected generalization loss in the high-dimensional regime that holds for arbitrary linear teachers. We demonstrate that isotropic regularization mitigates label noise under both single-teacher and multiple i.i.d. teacher settings, whereas prior work accommodating multiple teachers either did not employ regularization or used memory-demanding methods. Furthermore, we prove that the optimal fixed regularization strength scales nearly linearly with the number of tasks $T$, specifically as $T/\ln T$. To our knowledge, this is the first such result in theoretical continual learning. Finally, we validate our theoretical findings through experiments on linear regression and neural networks, illustrating how this scaling law affects generalization and offering a practical recipe for the design of continual learning systems.