Detecting non-causal artifacts in multivariate linear regression models
This addresses the issue of spurious correlations in statistical inference for researchers in causal discovery and machine learning, representing an incremental improvement in methodology.
The paper tackles the problem of distinguishing causal associations from non-causal artifacts like overfitting or confounding in multivariate linear regression models, and proposes a method based on the orientation of regression coefficients relative to the covariance matrix, showing that artifacts lead to concentration in low-eigenvalue spaces.
We consider linear models where $d$ potential causes $X_1,...,X_d$ are correlated with one target quantity $Y$ and propose a method to infer whether the association is causal or whether it is an artifact caused by overfitting or hidden common causes. We employ the idea that in the former case the vector of regression coefficients has 'generic' orientation relative to the covariance matrix $Σ_{XX}$ of $X$. Using an ICA based model for confounding, we show that both confounding and overfitting yield regression vectors that concentrate mainly in the space of low eigenvalues of $Σ_{XX}$.