MEMLApr 4, 2019

Cross-Validation for Correlated Data

arXiv:1904.02438v364 citations
Originality Incremental advance
AI Analysis

This addresses a methodological gap for researchers using cross-validation in fields with correlated data, such as time series or spatial statistics, though it is incremental as it builds on existing CV frameworks.

The paper tackles the problem of cross-validation being biased for correlated non-i.i.d. data by introducing a bias-corrected estimator, CV_c, which improves model evaluation and selection in simulations and real data studies.

K-fold cross-validation (CV) with squared error loss is widely used for evaluating predictive models, especially when strong distributional assumptions cannot be taken. However, CV with squared error loss is not free from distributional assumptions, in particular in cases involving non-i.i.d. data. This paper analyzes CV for correlated data. We present a criterion for suitability of standard CV in presence of correlations. When this criterion does not hold, we introduce a bias corrected cross-validation estimator which we term $CV_c,$ that yields an unbiased estimate of prediction error in many settings where standard CV is invalid. We also demonstrate our results numerically, and find that introducing our correction substantially improves both, model evaluation and model selection in simulations and real data studies.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes