Regularization-wise double descent: Why it occurs and how to eliminate it
This addresses the risk optimization challenge in deep learning for practitioners, but it is incremental as it builds on prior work on double descent phenomena.
The paper tackles the problem of double-descent shaped risk in overparameterized models, showing that explicit L2-regularized models exhibit this behavior as a function of regularization strength, and demonstrates that scaling regularization strengths per layer can eliminate it, with experiments on linear regression, neural networks, and CNNs achieving mitigation.
The risk of overparameterized models, in particular deep neural networks, is often double-descent shaped as a function of the model size. Recently, it was shown that the risk as a function of the early-stopping time can also be double-descent shaped, and this behavior can be explained as a super-position of bias-variance tradeoffs. In this paper, we show that the risk of explicit L2-regularized models can exhibit double descent behavior as a function of the regularization strength, both in theory and practice. We find that for linear regression, a double descent shaped risk is caused by a superposition of bias-variance tradeoffs corresponding to different parts of the model and can be mitigated by scaling the regularization strength of each part appropriately. Motivated by this result, we study a two-layer neural network and show that double descent can be eliminated by adjusting the regularization strengths for the first and second layer. Lastly, we study a 5-layer CNN and ResNet-18 trained on CIFAR-10 with label noise, and CIFAR-100 without label noise, and demonstrate that all exhibit double descent behavior as a function of the regularization strength.