On the Sample Complexity of Causal Discovery and the Value of Domain Expertise
This work addresses the practical problem of determining data requirements for causal discovery algorithms, which is important for researchers and practitioners working with observational data.
This paper analyzes the sample complexity of causal discovery algorithms that use statistical tests for conditional independence instead of a CI oracle. It quantifies the number of data points needed for a given confidence level and demonstrates the benefits of sparsity priors and known causal directions.
Causal discovery methods seek to identify causal relations between random variables from purely observational data, as opposed to actively collected experimental data where an experimenter intervenes on a subset of correlates. One of the seminal works in this area is the Inferred Causation algorithm, which guarantees successful causal discovery under the assumption of a conditional independence (CI) oracle: an oracle that can states whether two random variables are conditionally independent given another set of random variables. Practical implementations of this algorithm incorporate statistical tests for conditional independence, in place of a CI oracle. In this paper, we analyze the sample complexity of causal discovery algorithms without a CI oracle: given a certain level of confidence, how many data points are needed for a causal discovery algorithm to identify a causal structure? Furthermore, our methods allow us to quantify the value of domain expertise in terms of data samples. Finally, we demonstrate the accuracy of these sample rates with numerical examples, and quantify the benefits of sparsity priors and known causal directions.