ML LGFeb 7, 2022

Optimal Ratio for Data Splitting

arXiv:2202.03326v117.2663 citations

Originality Incremental advance

AI Analysis

This provides a theoretical guideline for researchers and practitioners in machine learning and statistics to allocate data efficiently, though it is incremental as it builds on existing splitting practices.

The paper tackles the problem of determining the optimal data splitting ratio for training and testing sets in statistical or machine learning models, showing that the optimal ratio is √p:1, where p is the number of parameters in a linear regression model.

It is common to split a dataset into training and testing sets before fitting a statistical or machine learning model. However, there is no clear guidance on how much data should be used for training and testing. In this article we show that the optimal splitting ratio is $\sqrt{p}:1$, where $p$ is the number of parameters in a linear regression model that explains the data well.

View on arXiv PDF

Similar