Demystifying amortized causal discovery with transformers
This work clarifies theoretical limitations for researchers in causal inference, indicating that supervised methods are incremental and still bound by classical assumptions.
The paper analyzes CSIvA, a transformer-based method for amortized causal discovery, showing that its performance depends on having a good prior on test data and identifiability, and it fails to generalize to unseen causal models, challenging the idea that supervised learning can bypass identifiability restrictions.
Supervised learning for causal discovery from observational data often achieves competitive performance despite seemingly avoiding the explicit assumptions that traditional methods require for identifiability. In this work, we analyze CSIvA (Ke et al., 2023) on bivariate causal models, a transformer architecture for amortized inference promising to train on synthetic data and transfer to real ones. First, we bridge the gap with identifiability theory, showing that the training distribution implicitly defines a prior on the causal model of the test observations: consistent with classical approaches, good performance is achieved when we have a good prior on the test data, and the underlying model is identifiable. Second, we find that CSIvA can not generalize to classes of causal models unseen during training: to overcome this limitation, we theoretically and empirically analyze \textit{when} training CSIvA on datasets generated by multiple identifiable causal models with different structural assumptions improves its generalization at test time. Overall, we find that amortized causal discovery with transformers still adheres to identifiability theory, violating the previous hypothesis from Lopez-Paz et al. (2015) that supervised learning methods could overcome its restrictions.