LGNov 12, 2025

Generalization Can Emerge in Tabular Foundation Models From a Single Table

Junwei Ma, Nour Shaheen, Alex Labach, Amine Mhedhbi, Frank Hutter, Anthony L. Caterini, Valentin Thomas

arXiv:2511.09665v17.11 citationsh-index: 16

Originality Incremental advance

AI Analysis

This work addresses the problem of data efficiency for researchers and practitioners in tabular deep learning, offering an incremental improvement by reducing pre-training data needs.

The paper challenges the view that broad generalization in tabular foundation models requires large pre-training datasets, showing that self-supervised pre-training on just a single real table can achieve strong transfer across heterogeneous benchmarks.

Deep tabular modelling increasingly relies on in-context learning where, during inference, a model receives a set of $(x,y)$ pairs as context and predicts labels for new inputs without weight updates. We challenge the prevailing view that broad generalization here requires pre-training on large synthetic corpora (e.g., TabPFN priors) or a large collection of real data (e.g., TabDPT training datasets), discovering that a relatively small amount of data suffices for generalization. We find that simple self-supervised pre-training on just a \emph{single} real table can produce surprisingly strong transfer across heterogeneous benchmarks. By systematically pre-training and evaluating on many diverse datasets, we analyze what aspects of the data are most important for building a Tabular Foundation Model (TFM) generalizing across domains. We then connect this to the pre-training procedure shared by most TFMs and show that the number and quality of \emph{tasks} one can construct from a dataset is key to downstream performance.

View on arXiv PDF

Similar