ML LG ST MEApr 19, 2022

Approximating Persistent Homology for Large Datasets

arXiv:2204.09155v310.814 citationsh-index: 12Has Code

Originality Incremental advance

AI Analysis

This work addresses a bottleneck in topological data analysis for researchers dealing with massive datasets, offering a practical solution with theoretical guarantees.

The paper tackles the computational infeasibility of persistent homology for large datasets by proposing a multiple subsampling framework, deriving finite sample convergence rates for empirical means, and validating the approach with experiments on synthetic and real-world data, including Poincaré embeddings of lexical databases.

Persistent homology is an important methodology in topological data analysis which adapts theory from algebraic topology to data settings. Computing persistent homology produces persistence diagrams, which have been successfully used in diverse domains. Despite its widespread use, persistent homology is simply impossible to compute when a dataset is very large. We study a statistical approach to the problem of computing persistent homology for massive datasets using a multiple subsampling framework and extend it to three summaries of persistent homology: Hölder continuous vectorizations of persistence diagrams; the alternative representation as persistence measures; and standard persistence diagrams. Specifically, we derive finite sample convergence rates for empirical means for persistent homology and practical guidance on interpreting and tuning parameters. We validate our approach through extensive experiments on both synthetic and real-world data. We demonstrate the performance of multiple subsampling in a permutation test to analyze the topological structure of Poincaré embeddings of large lexical databases.

View on arXiv PDF Code

Similar