LG MLDec 23, 2024

Fast Causal Discovery by Approximate Kernel-based Generalized Score Functions with Linear Computational Complexity

Yixin Ren, Haocheng Zhang, Yewei Xia, Hao Zhang, Jihong Guan, Shuigeng Zhou

arXiv:2412.17717v22.61 citationsh-index: 51KDD

Originality Incremental advance

AI Analysis

This work addresses computational bottlenecks for researchers and practitioners using causal discovery on large datasets, though it is incremental as it builds on existing kernel-based score functions.

The paper tackles the high computational complexity of kernel-based generalized score functions in causal discovery, which have O(n^3) time and O(n^2) space costs, by proposing an approximate method with O(n) time and space complexities that achieves comparable accuracy to state-of-the-art methods, especially on large datasets.

Score-based causal discovery methods can effectively identify causal relationships by evaluating candidate graphs and selecting the one with the highest score. One popular class of scores is kernel-based generalized score functions, which can adapt to a wide range of scenarios and work well in practice because they circumvent assumptions about causal mechanisms and data distributions. Despite these advantages, kernel-based generalized score functions pose serious computational challenges in time and space, with a time complexity of $\mathcal{O}(n^3)$ and a memory complexity of $\mathcal{O}(n^2)$, where $n$ is the sample size. In this paper, we propose an approximate kernel-based generalized score function with $\mathcal{O}(n)$ time and space complexities by using low-rank technique and designing a set of rules to handle the complex composite matrix operations required to calculate the score, as well as developing sampling algorithms for different data types to benefit the handling of diverse data types efficiently. Our extensive causal discovery experiments on both synthetic and real-world data demonstrate that compared to the state-of-the-art method, our method can not only significantly reduce computational costs, but also achieve comparable accuracy, especially for large datasets.

View on arXiv PDF

Similar