LGJul 14, 2021

Disparity Between Batches as a Signal for Early Stopping

arXiv:2107.06665v112 citations
Originality Incremental advance
AI Analysis

This provides a practical early-stopping criterion for machine learning practitioners dealing with data limitations or label noise, but it is incremental as it builds on existing gradient-based methods.

The authors tackled the problem of early stopping in deep neural networks by proposing gradient disparity, a metric based on the distance between gradients of mini-batches, and showed it effectively signals overfitting, especially with limited or noisy data, achieving strong correlation with generalization error.

We propose a metric for evaluating the generalization ability of deep neural networks trained with mini-batch gradient descent. Our metric, called gradient disparity, is the $\ell_2$ norm distance between the gradient vectors of two mini-batches drawn from the training set. It is derived from a probabilistic upper bound on the difference between the classification errors over a given mini-batch, when the network is trained on this mini-batch and when the network is trained on another mini-batch of points sampled from the same dataset. We empirically show that gradient disparity is a very promising early-stopping criterion (i) when data is limited, as it uses all the samples for training and (ii) when available data has noisy labels, as it signals overfitting better than the validation data. Furthermore, we show in a wide range of experimental settings that gradient disparity is strongly related to the generalization error between the training and test sets, and that it is also very informative about the level of label noise.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes