Real-TabPFN: Improving Tabular Foundation Models via Continued Pre-training With Real-World Data
This addresses the challenge of enhancing predictive accuracy for tabular data tasks, though it appears incremental as it builds on existing TabPFN methods.
The paper tackles the problem of improving tabular foundation models by showing that continued pre-training with real-world data significantly boosts performance, with Real-TabPFN achieving substantial gains on 29 datasets from the OpenML AutoML Benchmark.
Foundation models for tabular data, like TabPFN, achieve strong performance on small datasets when pre-trained solely on synthetic data. We show that this performance can be significantly boosted by a targeted continued pre-training phase. Specifically, we demonstrate that leveraging a small, curated collection of large, real-world datasets for continued pre-training yields superior downstream predictive accuracy compared to using broader, potentially noisier corpora like CommonCrawl or GitTables. Our resulting model, Real-TabPFN, achieves substantial performance gains on 29 datasets from the OpenML AutoML Benchmark.