LGJan 7

Causal Data Augmentation for Robust Fine-Tuning of Tabular Foundation Models

Magnus Bühler, Lennart Purucker, Frank Hutter

arXiv:2601.04110v13.82 citationsh-index: 8

Originality Incremental advance

AI Analysis

This work addresses the problem of robust fine-tuning for tabular foundation models in low-data regimes, offering a principled solution that enhances generalization and validation reliability, though it is incremental in applying causal methods to a specific domain.

The paper tackles the challenge of fine-tuning tabular foundation models under data scarcity by proposing CausalMixFT, a method that uses structural causal models to generate synthetic samples for data augmentation, resulting in improved median normalized ROC-AUC from 0.10 to 0.12 across 33 datasets and reducing the validation-test performance correlation gap from 0.67 to 0.30.

Fine-tuning tabular foundation models (TFMs) under data scarcity is challenging, as early stopping on even scarcer validation data often fails to capture true generalization performance. We propose CausalMixFT, a method that enhances fine-tuning robustness and downstream performance by generating structurally consistent synthetic samples using Structural Causal Models (SCMs) fitted on the target dataset. This approach augments limited real data with causally informed synthetic examples, preserving feature dependencies while expanding training diversity. Evaluated across 33 classification datasets from TabArena and over 2300 fine-tuning runs, our CausalMixFT method consistently improves median normalized ROC-AUC from 0.10 (standard fine-tuning) to 0.12, outperforming purely statistical generators such as CTGAN (-0.01), TabEBM (-0.04), and TableAugment (-0.09). Moreover, it narrows the median validation-test performance correlation gap from 0.67 to 0.30, enabling more reliable validation-based early stopping, a key step toward improving fine-tuning stability under data scarcity. These results demonstrate that incorporating causal structure into data augmentation provides an effective and principled route to fine-tuning tabular foundation models in low-data regimes.

View on arXiv PDF

Similar