Revisiting Generalization Measures Beyond IID: An Empirical Study under Distributional Shift
This addresses the challenge of predicting model performance under distributional shift for deep learning practitioners, but it is incremental as it builds on prior large-scale studies.
The study benchmarks the robustness of generalization measures beyond the IID regime by training models over 10,000 hyperparameter configurations and evaluating more than 40 measures, finding that distribution shifts substantially alter predictive performance for many measures while a smaller subset remains stable.
Generalization remains a central yet unresolved challenge in deep learning, particularly the ability to predict a model's performance beyond its training distribution using quantities available prior to test-time evaluation. Building on the large-scale study of Jiang et al. (2020). and concerns by Dziugaite et al. (2020). about instability across training configurations, we benchmark the robustness of generalization measures beyond IID regime. We train small-to-medium models over 10,000 hyperparameter configurations and evaluate more than 40 measures computable from the trained model and the available training data alone. We significantly broaden the experimental scope along multiple axes: (i) extending the evaluation beyond the standard IID setting to include benchmarking for robustness across diverse distribution shifts, (ii) evaluating multiple architectures and training recipes, and (iii) newly incorporating calibration- and information-criteria-based measures to assess their alignment with both IID and OOD generalization. We find that distribution shifts can substantially alter the predictive performance of many generalization measures, while a smaller subset remains comparatively stable across settings.