LGMay 10

Tabular Foundation Model for Generative Modelling

Xiangjian Jiang, Mingxuan Liu, Nikola Simidjievski, Tassilo Klein, Mateja Jamnik

arXiv:2605.0942479.4

AI Analysis

This work addresses the gap in generative tabular foundation models, which previously failed to match dataset-specific generators, by introducing a model that effectively utilizes causal structure and mitigates latent distribution shifts.

TabFORGE is a tabular foundation model for generative modelling that leverages a pretrained causality-aware feature encoder and a two-stage design (score-based diffusion transformer + denoising-aligned decoder) to generate high-quality synthetic tabular data. On 45 real-world datasets against 22 benchmarks, it achieves strong structural fidelity and competitive synthetic data quality.

Generative modelling is a demanding test of foundation models, because it requires robust, holistic representation learning for a given data modality, rather than optimisation for a supervised prediction target alone. While recent work on tabular foundation models has achieved remarkable progress in predictive modelling, generative tabular foundation models remain underexplored. Existing tabular foundation generators, in particular, have not yet consistently matched strong dataset-specific generators in synthetic data quality. A key reason is their misalignment with the distinctive causal structural prior of heterogeneous tabular data. In this paper, we address this gap by introducing a novel tabular foundation model, \textbf{TabFORGE}, built on pretrained \textbf{Tab}ular \textbf{FO}undational \textbf{R}epresentations for \textbf{GE}neration. TabFORGE is designed to utilise the implicitly learned causal information underlying diverse tabular datasets in a unified latent space induced by a pretrained causality-aware feature encoder. It further decouples latent modelling from decoding through a two-stage design: we first pretrain a score-based diffusion transformer, and then pretrain a denoising-aligned decoder using the denoised latent embeddings. This design elegantly mitigates the distribution shifts in latent embeddings that typically arise between training and inference. We evaluate TabFORGE comprehensively against 22 benchmark methods on 45 real-world datasets. Our results show that TabFORGE effectively learns and leverages generalisable tabular representations, enabling efficient generation of high-quality synthetic tabular data, particularly with strong structural fidelity.

View on arXiv PDF

Similar