Assessing Generalization of SGD via Disagreement
This work addresses the challenge of predicting generalization in deep learning for researchers and practitioners, offering an incremental improvement over existing methods by reducing data requirements.
The authors tackled the problem of estimating test error in deep networks by showing that disagreement between two models trained with different SGD runs on the same data can predict test error, building on prior work that required fresh training sets. They achieved this by empirically demonstrating the method and theoretically linking it to calibration, providing a simple tool for generalization assessment.
We empirically show that the test error of deep networks can be estimated by simply training the same architecture on the same training set but with a different run of Stochastic Gradient Descent (SGD), and measuring the disagreement rate between the two networks on unlabeled test data. This builds on -- and is a stronger version of -- the observation in Nakkiran & Bansal '20, which requires the second run to be on an altogether fresh training set. We further theoretically show that this peculiar phenomenon arises from the \emph{well-calibrated} nature of \emph{ensembles} of SGD-trained models. This finding not only provides a simple empirical measure to directly predict the test error using unlabeled test data, but also establishes a new conceptual connection between generalization and calibration.