MLLGPRApr 27, 2025

Test Set Sizing for the Ridge Regression

arXiv:2504.19231v2
Originality Synthesis-oriented
AI Analysis

This provides a theoretical foundation for test set sizing in ridge regression, addressing a practical problem for machine learning practitioners, though it is incremental as it extends known results to a specific model.

The paper derives the optimal train/test split for ridge regression in the large-data limit to maximize integrity, showing that the split depends weakly on the ridge parameter and asymptotically matches prior results for linear regression.

We derive the ideal train/test split for the ridge regression to high accuracy in the limit that the number of training rows m becomes large. The split must depend on the ridge tuning parameter, alpha, but we find that the dependence is weak and can asymptotically be ignored; all parameters vanish except for m and the number of features, n, which is held constant. This is the first time that such a split is calculated mathematically for a machine learning model in the large data limit. The goal of the calculations is to maximize "integrity," so that the measured error in the trained model is as close as possible to what it theoretically should be. This paper's result for the ridge regression split matches prior art for the plain vanilla linear regression split to the first two terms asymptotically.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes