A Bayesian Perspective on Training Speed and Model Selection
This work addresses model selection and generalization in machine learning, particularly for neural networks, but it is incremental as it builds on existing Bayesian perspectives and applies them to new contexts.
The authors tackled the problem of connecting training speed to model selection by showing that a measure of training speed can estimate marginal likelihood in linear models and predict model weighting in combinations, with verification in linear models and deep neural networks. They provided empirical evidence that this intuition holds for deep neural networks trained with SGD, suggesting a new direction to explain generalization bias.
We take a Bayesian perspective to illustrate a connection between training speed and the marginal likelihood in linear models. This provides two major insights: first, that a measure of a model's training speed can be used to estimate its marginal likelihood. Second, that this measure, under certain conditions, predicts the relative weighting of models in linear model combinations trained to minimize a regression loss. We verify our results in model selection tasks for linear models and for the infinite-width limit of deep neural networks. We further provide encouraging empirical evidence that the intuition developed in these settings also holds for deep neural networks trained with stochastic gradient descent. Our results suggest a promising new direction towards explaining why neural networks trained with stochastic gradient descent are biased towards functions that generalize well.