Embed and Conquer: Scalable Embeddings for Kernel k-Means on MapReduce
This work addresses scalability issues for kernel k-means in distributed computing environments, offering a solution for handling large datasets, though it appears incremental as it builds on existing kernel methods.
The paper tackles the computational complexity and parallelization challenges of kernel k-means by defining a family of low-dimensional embeddings and proposing scalable MapReduce algorithms, demonstrating effectiveness and efficiency through empirical evaluation on benchmark datasets.
The kernel $k$-means is an effective method for data clustering which extends the commonly-used $k$-means algorithm to work on a similarity matrix over complex data structures. The kernel $k$-means algorithm is however computationally very complex as it requires the complete data matrix to be calculated and stored. Further, the kernelized nature of the kernel $k$-means algorithm hinders the parallelization of its computations on modern infrastructures for distributed computing. In this paper, we are defining a family of kernel-based low-dimensional embeddings that allows for scaling kernel $k$-means on MapReduce via an efficient and unified parallelization strategy. Afterwards, we propose two methods for low-dimensional embedding that adhere to our definition of the embedding family. Exploiting the proposed parallelization strategy, we present two scalable MapReduce algorithms for kernel $k$-means. We demonstrate the effectiveness and efficiency of the proposed algorithms through an empirical evaluation on benchmark data sets.