LGMay 29

TabCausal: Pretraining Across Causal Environments for Tabular Causal Discovery

arXiv:2605.3115670.4
Predicted impact top 25% in LG · last 90 daysOriginality Incremental advance
AI Analysis

This work aims to improve the transferability and performance of amortized causal discovery for researchers and practitioners working with tabular data, especially in scenarios involving interventional evidence. It represents an incremental advancement in the field of Causal Discovery Foundation Models.

This paper introduces TabCausal, a data-driven Causal Discovery Foundation Model (CDFM) designed to recover directed causal relations from tabular data. It addresses limitations in existing CDFMs by employing a broad causal pretraining strategy across diverse graph priors, structural mechanisms, noise models, dimensions, sample sizes, and intervention regimes. TabCausal achieves better macro-averaged performance than diverse causal discovery baselines on large-scale synthetic benchmarks and demonstrates robust structure recovery, particularly with interventional evidence, across both synthetic and semantic environments.

Causal discovery aims to recover directed causal relations from observational and interventional data, providing a basis for mechanistic understanding and reliable decision-making. Causal discovery foundation models (CDFMs) seek to amortize this problem by mapping a dataset directly to a causal graph in a single forward pass, avoiding per-dataset testing, search, or optimization. However, existing CDFMs remain limited, often failing to consistently match strong classical methods, and we find that a key bottleneck is how causal pretraining tasks are constructed. Based on this observation, we propose TabCausal, a data-driven CDFM trained with broad causal pretraining over diverse graph priors, structural mechanisms, noise models, dimensions, sample sizes, and intervention regimes. A dynamic task construction strategy composes these causal environments into varied discovery tasks, enabling more transferable structural learning from observational and mixed-interventional data. On large-scale synthetic benchmarks, TabCausal achieves better macro-averaged performance than a diverse set of causal discovery baselines. To further bridge abstract synthetic generators and realistic causal reasoning scenarios, we introduce a protocol-guided and LLM-audited semantic causal environment benchmark, where domain-grounded SCMs generate interpretable observational and interventional datasets for out-of-distribution analysis. Across both synthetic and semantic environments, TabCausal demonstrates robust structure recovery, especially under interventional evidence, highlighting broad causal pretraining as a key ingredient for transferable amortized causal discovery.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes