MLAILGOct 11, 2024

SOAK: Same/Other/All K-fold cross-validation for estimating similarity of patterns in data subsets

arXiv:2410.08643v1h-index: 19
Originality Incremental advance
AI Analysis

This method addresses the practical issue of data subset similarity for machine learning practitioners, but it is incremental as it builds on existing cross-validation techniques.

The authors tackled the problem of estimating similarity between data subsets to determine if training on one subset yields accurate predictions on another, proposing SOAK, a cross-validation method that compares models trained on different subsets. They demonstrated SOAK on 20 datasets, including real-world and benchmark data, to assess subset similarity and prediction accuracy.

In many real-world applications of machine learning, we are interested to know if it is possible to train on the data that we have gathered so far, and obtain accurate predictions on a new test data subset that is qualitatively different in some respect (time period, geographic region, etc). Another question is whether data subsets are similar enough so that it is beneficial to combine subsets during model training. We propose SOAK, Same/Other/All K-fold cross-validation, a new method which can be used to answer both questions. SOAK systematically compares models which are trained on different subsets of data, and then used for prediction on a fixed test subset, to estimate the similarity of learnable/predictable patterns in data subsets. We show results of using SOAK on six new real data sets (with geographic/temporal subsets, to check if predictions are accurate on new subsets), 3 image pair data sets (subsets are different image types, to check that we get smaller prediction error on similar images), and 11 benchmark data sets with predefined train/test splits (to check similarity of predefined splits).

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes