MLLGOct 31, 2015

Preconditioned Data Sparsification for Big Data with Applications to PCA and K-means

arXiv:1511.00152v361 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of big data scalability for machine learning practitioners, offering an incremental improvement over existing sampling methods.

The paper tackles the problem of processing large datasets efficiently by proposing a compression scheme that randomly sparsifies data components, enabling faster PCA and K-means computations, especially in distributed or streaming settings, with demonstrated benefits on test datasets.

We analyze a compression scheme for large data sets that randomly keeps a small percentage of the components of each data sample. The benefit is that the output is a sparse matrix and therefore subsequent processing, such as PCA or K-means, is significantly faster, especially in a distributed-data setting. Furthermore, the sampling is single-pass and applicable to streaming data. The sampling mechanism is a variant of previous methods proposed in the literature combined with a randomized preconditioning to smooth the data. We provide guarantees for PCA in terms of the covariance matrix, and guarantees for K-means in terms of the error in the center estimators at a given step. We present numerical evidence to show both that our bounds are nearly tight and that our algorithms provide a real benefit when applied to standard test data sets, as well as providing certain benefits over related sampling approaches.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes