LGCVFeb 2, 2013

Parallel D2-Clustering: Large-Scale Clustering of Discrete Distributions

arXiv:1302.0435v21 citations
AI Analysis

This work addresses scalability issues for researchers and practitioners applying D2-clustering to large datasets in domains like computer vision and bioinformatics, though it is incremental as it builds on an existing method.

The paper tackles the high computational complexity of D2-clustering for discrete distributions, presenting a parallel algorithm that achieves significant speed-up on large-scale image, video, and protein data with minor accuracy loss.

The discrete distribution clustering algorithm, namely D2-clustering, has demonstrated its usefulness in image classification and annotation where each object is represented by a bag of weighed vectors. The high computational complexity of the algorithm, however, limits its applications to large-scale problems. We present a parallel D2-clustering algorithm with substantially improved scalability. A hierarchical structure for parallel computing is devised to achieve a balance between the individual-node computation and the integration process of the algorithm. Additionally, it is shown that even with a single CPU, the hierarchical structure results in significant speed-up. Experiments on real-world large-scale image data, Youtube video data, and protein sequence data demonstrate the efficiency and wide applicability of the parallel D2-clustering algorithm. The loss in clustering accuracy is minor in comparison with the original sequential algorithm.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes