MLLGFeb 7, 2022

Optimal Ratio for Data Splitting

arXiv:2202.03326v1663 citations
AI Analysis

This provides a theoretical guideline for researchers and practitioners in machine learning and statistics to allocate data efficiently, though it is incremental as it builds on existing splitting practices.

The paper tackles the problem of determining the optimal data splitting ratio for training and testing sets in statistical or machine learning models, showing that the optimal ratio is √p:1, where p is the number of parameters in a linear regression model.

It is common to split a dataset into training and testing sets before fitting a statistical or machine learning model. However, there is no clear guidance on how much data should be used for training and testing. In this article we show that the optimal splitting ratio is $\sqrt{p}:1$, where $p$ is the number of parameters in a linear regression model that explains the data well.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes