LGNov 11, 2013

Embed and Conquer: Scalable Embeddings for Kernel k-Means on MapReduce

arXiv:1311.2334v418 citations
Originality Incremental advance
AI Analysis

This work addresses scalability issues for kernel k-means in distributed computing environments, offering a solution for handling large datasets, though it appears incremental as it builds on existing kernel methods.

The paper tackles the computational complexity and parallelization challenges of kernel k-means by defining a family of low-dimensional embeddings and proposing scalable MapReduce algorithms, demonstrating effectiveness and efficiency through empirical evaluation on benchmark datasets.

The kernel $k$-means is an effective method for data clustering which extends the commonly-used $k$-means algorithm to work on a similarity matrix over complex data structures. The kernel $k$-means algorithm is however computationally very complex as it requires the complete data matrix to be calculated and stored. Further, the kernelized nature of the kernel $k$-means algorithm hinders the parallelization of its computations on modern infrastructures for distributed computing. In this paper, we are defining a family of kernel-based low-dimensional embeddings that allows for scaling kernel $k$-means on MapReduce via an efficient and unified parallelization strategy. Afterwards, we propose two methods for low-dimensional embedding that adhere to our definition of the embedding family. Exploiting the proposed parallelization strategy, we present two scalable MapReduce algorithms for kernel $k$-means. We demonstrate the effectiveness and efficiency of the proposed algorithms through an empirical evaluation on benchmark data sets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes