ML AI LG GNDec 18, 2020

MASSIVE: Tractable and Robust Bayesian Learning of Many-Dimensional Instrumental Variable Models

Ioan Gabriel Bucur, Tom Claassen, Tom Heskes

arXiv:2012.10141v12.72 citations

Originality Highly original

AI Analysis

This work provides a more robust and tractable method for causal inference in high-dimensional settings, which is crucial for researchers working with large datasets like those in genomics, where traditional IV selection is often intractable.

This paper addresses the challenge of identifying valid instrumental variables (IVs) from high-dimensional datasets, such as those in GWAS, for causal inference. The authors propose a Bayesian model averaging algorithm that effectively combines information from many candidate IVs, even when their individual validity is uncertain, to produce reliable causal effect estimates.

The recent availability of huge, many-dimensional data sets, like those arising from genome-wide association studies (GWAS), provides many opportunities for strengthening causal inference. One popular approach is to utilize these many-dimensional measurements as instrumental variables (instruments) for improving the causal effect estimate between other pairs of variables. Unfortunately, searching for proper instruments in a many-dimensional set of candidates is a daunting task due to the intractable model space and the fact that we cannot directly test which of these candidates are valid, so most existing search methods either rely on overly stringent modeling assumptions or fail to capture the inherent model uncertainty in the selection process. We show that, as long as at least some of the candidates are (close to) valid, without knowing a priori which ones, they collectively still pose enough restrictions on the target interaction to obtain a reliable causal effect estimate. We propose a general and efficient causal inference algorithm that accounts for model uncertainty by performing Bayesian model averaging over the most promising many-dimensional instrumental variable models, while at the same time employing weaker assumptions regarding the data generating process. We showcase the efficiency, robustness and predictive performance of our algorithm through experimental results on both simulated and real-world data.

View on arXiv PDF

Similar