LGApr 24

TabSCM: A practical Framework for Generating Realistic Tabular Data

arXiv:2604.2233746.3h-index: 38
AI Analysis

For practitioners needing causally sound synthetic tabular data (e.g., healthcare, finance), TabSCM offers a practical, interpretable, and fast alternative to existing generative models.

TabSCM generates realistic tabular data that preserves causal dependencies, outperforming GAN, diffusion, and LLM baselines in statistical fidelity, downstream utility, and privacy risk while reducing rule violations and enabling counterfactual queries. It achieves up to 583x speedup over diffusion-only models.

Most tabular-data generators match marginal statistics yet ignore causal structure, leading downstream models to learn spurious or unfair patterns. We present TabSCM, a mixed-type generator that preserves those causal dependencies. Starting from a Completed Partially Directed Acyclic Graph (CPDAG) found by any causal structure discovery algorithm, TabSCM (i) orients edges to a DAG, (ii) fits root-node marginals with KDE or categorical frequencies, and (iii) learns topologically ordered structural assignments. Such assignments are achieved using conditional diffusion models for continuous variables as child nodes and gradient-boosted trees for categorical ones. Ancestral sampling yields semantically valid records and enables exact counterfactual queries. On seven public datasets, encompassing healthcare, finance, housing, environment, TabSCM matches or surpasses state-of-the-art GAN, diffusion, and LLM baselines in statistical fidelity, downstream utility, and privacy risk, while also cutting rule-violation rates and providing causally meaningful and robust conditional interventions. Because generation is decomposed into explicit equations, it runs up to 583$\times$ faster than diffusion-only models and exposes interpretable knobs for fairness auditing and policy simulation, making TabSCM a practical choice for realism, explainability, and causal soundness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes