CVApr 2

Beyond the Fold: Quantifying Split-Level Noise and the Case for Leave-One-Dataset-Out AU Evaluation

Saurabh Hinduja, Gurmeet Kaur, Maneesh Bilalpur, Jeffrey Cohn, Shaun Canavan

arXiv:2604.0216212.6

AI Analysis

This addresses evaluation reliability for researchers in facial expression analysis, highlighting that reported improvements may be incremental due to protocol noise.

The paper tackles the problem of stochastic variance in cross-validation for facial Action Unit detection, showing that repeated 3-fold splits introduce an empirical noise floor of ±0.065 in average F1, with larger variations for low-prevalence AUs, and proposes Leave-One-Dataset-Out evaluation for more stable results.

Subject-exclusive cross-validation is the standard evaluation protocol for facial Action Unit (AU) detection, yet reported improvements are often small. We show that cross-validation itself introduces measurable stochastic variance. On BP4D+, repeated 3-fold subject-exclusive splits produce an empirical noise floor of $\pm 0.065$ in average F1, with substantially larger variation for low-prevalence AUs. Operating-point metrics such as F1 fluctuate more than threshold-independent measures such as AUC, and model ranking can change under different fold assignments. We further evaluate cross-dataset robustness using a Leave-One-Dataset-Out (LODO) protocol across five AU datasets. LODO removes partition randomness and exposes domain-level instability that is not visible under single-dataset cross-validation. Together, these results suggest that gains often reported in cross-fold validation may fall within protocol variance. Leave-one-dataset-out cross-validation yields more stable and interpretable findings

View on arXiv PDF

Similar