LGMar 23, 2015

Communication Efficient Distributed Kernel Principal Component Analysis

arXiv:1503.06858v414 citations
Originality Highly original
AI Analysis

This addresses the challenge of distributed nonlinear feature extraction for large-scale data applications, offering a provably efficient solution.

The paper tackles the problem of performing kernel PCA on large distributed datasets without costly data centralization, developing a communication-efficient algorithm that computes global kernel principal components with relative error guarantees, achieving communication cost of $ ilde{O}(s ho k/\varepsilon + s k^2/\varepsilon^3)$ words for $k$ components over $s$ workers.

Kernel Principal Component Analysis (KPCA) is a key machine learning algorithm for extracting nonlinear features from data. In the presence of a large volume of high dimensional data collected in a distributed fashion, it becomes very costly to communicate all of this data to a single data center and then perform kernel PCA. Can we perform kernel PCA on the entire dataset in a distributed and communication efficient fashion while maintaining provable and strong guarantees in solution quality? In this paper, we give an affirmative answer to the question by developing a communication efficient algorithm to perform kernel PCA in the distributed setting. The algorithm is a clever combination of subspace embedding and adaptive sampling techniques, and we show that the algorithm can take as input an arbitrary configuration of distributed datasets, and compute a set of global kernel principal components with relative error guarantees independent of the dimension of the feature space or the total number of data points. In particular, computing $k$ principal components with relative error $ε$ over $s$ workers has communication cost $\tilde{O}(s ρk/ε+s k^2/ε^3)$ words, where $ρ$ is the average number of nonzero entries in each data point. Furthermore, we experimented the algorithm with large-scale real world datasets and showed that the algorithm produces a high quality kernel PCA solution while using significantly less communication than alternative approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes