ML COOct 17, 2016

The xyz algorithm for fast interaction search in high-dimensional data

Gian-Andrea Thanei, Nicolai Meinshausen, Rajen D. Shah

arXiv:1610.05108v46.324 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the computational bottleneck in high-dimensional data analysis, such as genome-wide association studies, by enabling faster interaction search, though it is incremental as it builds on existing methods for closest-pair problems.

The paper tackles the problem of efficiently discovering interactions in high-dimensional regression by introducing the xyz algorithm, which reduces computational complexity from at least O(p^2) to subquadratic time, screening over 10^11 interactions in under 280 seconds on a single-core CPU.

When performing regression on a dataset with $p$ variables, it is often of interest to go beyond using main linear effects and include interactions as products between individual variables. For small-scale problems, these interactions can be computed explicitly but this leads to a computational complexity of at least $\mathcal{O}(p^2)$ if done naively. This cost can be prohibitive if $p$ is very large. We introduce a new randomised algorithm that is able to discover interactions with high probability and under mild conditions has a runtime that is subquadratic in $p$. We show that strong interactions can be discovered in almost linear time, whilst finding weaker interactions requires $\mathcal{O}(p^α)$ operations for $1 < α< 2$ depending on their strength. The underlying idea is to transform interaction search into a closestpair problem which can be solved efficiently in subquadratic time. The algorithm is called $\mathit{xyz}$ and is implemented in the language R. We demonstrate its efficiency for application to genome-wide association studies, where more than $10^{11}$ interactions can be screened in under $280$ seconds with a single-core $1.2$ GHz CPU.

View on arXiv PDF Code

Similar