Gradient-based bilevel optimization for multi-penalty Ridge regression through matrix differential calculus
This work addresses the problem of improving generalization in linear regression for practitioners by enabling more flexible regularization, though it is incremental as it builds on existing Ridge regression and optimization techniques.
The paper tackles the limitation of single hyperparameter regularization in linear regression by developing a multi-hyperparameter Ridge regression method, where each input variable has its own regularization parameter optimized via gradient-based bilevel optimization. Numerical results show it outperforms LASSO, Ridge, and Elastic Net regression, with analytical gradient computation being more computationally efficient than automatic differentiation for many variables.
Common regularization algorithms for linear regression, such as LASSO and Ridge regression, rely on a regularization hyperparameter that balances the tradeoff between minimizing the fitting error and the norm of the learned model coefficients. As this hyperparameter is scalar, it can be easily selected via random or grid search optimizing a cross-validation criterion. However, using a scalar hyperparameter limits the algorithm's flexibility and potential for better generalization. In this paper, we address the problem of linear regression with l2-regularization, where a different regularization hyperparameter is associated with each input variable. We optimize these hyperparameters using a gradient-based approach, wherein the gradient of a cross-validation criterion with respect to the regularization hyperparameters is computed analytically through matrix differential calculus. Additionally, we introduce two strategies tailored for sparse model learning problems aiming at reducing the risk of overfitting to the validation data. Numerical examples demonstrate that our multi-hyperparameter regularization approach outperforms LASSO, Ridge, and Elastic Net regression. Moreover, the analytical computation of the gradient proves to be more efficient in terms of computational time compared to automatic differentiation, especially when handling a large number of input variables. Application to the identification of over-parameterized Linear Parameter-Varying models is also presented.