LG AI MLOct 31, 2023

Efficient Bayesian Learning Curve Extrapolation using Prior-Data Fitted Networks

Steven Adriaensen, Herilalaina Rakotoarison, Samuel Müller, Frank Hutter

arXiv:2310.20447v124.047 citationsh-index: 11

Originality Incremental advance

AI Analysis

This provides a computationally efficient Bayesian method for predicting training outcomes, which is useful for researchers and practitioners in machine learning who need to optimize training time and resources, though it builds incrementally on prior-data fitted networks.

The paper tackles learning curve extrapolation by proposing LC-PFN, a prior-data fitted network that approximates Bayesian posterior predictive distributions for predicting model performance in later training epochs. It achieves over 10,000x speedup compared to MCMC while maintaining competitive accuracy on 20,000 real learning curves across diverse datasets and models, and enables 2-6x speed-ups in model selection on 45 datasets.

Learning curve extrapolation aims to predict model performance in later epochs of training, based on the performance in earlier epochs. In this work, we argue that, while the inherent uncertainty in the extrapolation of learning curves warrants a Bayesian approach, existing methods are (i) overly restrictive, and/or (ii) computationally expensive. We describe the first application of prior-data fitted neural networks (PFNs) in this context. A PFN is a transformer, pre-trained on data generated from a prior, to perform approximate Bayesian inference in a single forward pass. We propose LC-PFN, a PFN trained to extrapolate 10 million artificial right-censored learning curves generated from a parametric prior proposed in prior art using MCMC. We demonstrate that LC-PFN can approximate the posterior predictive distribution more accurately than MCMC, while being over 10 000 times faster. We also show that the same LC-PFN achieves competitive performance extrapolating a total of 20 000 real learning curves from four learning curve benchmarks (LCBench, NAS-Bench-201, Taskset, and PD1) that stem from training a wide range of model architectures (MLPs, CNNs, RNNs, and Transformers) on 53 different datasets with varying input modalities (tabular, image, text, and protein data). Finally, we investigate its potential in the context of model selection and find that a simple LC-PFN based predictive early stopping criterion obtains 2 - 6x speed-ups on 45 of these datasets, at virtually no overhead.

View on arXiv PDF

Similar