Changing the Kernel During Training Leads to Double Descent in Kernel Regression
This addresses the need for improved generalization and efficiency in kernel methods and neural networks, offering a novel training strategy that circumvents model selection.
The paper tackles the problem of kernel regression by proposing to decrease kernel bandwidth during training, which leads to double descent behavior and benign overfitting, outperforming constant-bandwidth methods on real and synthetic data and reducing iterations in neural networks.
We investigate changing the bandwidth of a translational-invariant kernel during training when solving kernel regression with gradient descent. We present a theoretical bound on the out-of-sample generalization error that advocates for decreasing the bandwidth (and thus increasing the model complexity) during training. We further use the bound to show that kernel regression exhibits a double descent behavior when the model complexity is expressed as the minimum allowed bandwidth during training. Decreasing the bandwidth all the way to zero results in benign overfitting, and also circumvents the need for model selection. We demonstrate the double descent behavior on real and synthetic data and also demonstrate that kernel regression with a decreasing bandwidth outperforms that of a constant bandwidth, selected by cross-validation or marginal likelihood maximization. We finally apply our findings to neural networks, demonstrating that by modifying the neural tangent kernel (NTK) during training, making the NTK behave as if its bandwidth were decreasing to zero, we can make the network overfit more benignly, and converge in fewer iterations.