LGCVFeb 24, 2024

Scalable Density-based Clustering with Random Projections

arXiv:2402.15679v22 citationsh-index: 1
Originality Highly original
AI Analysis

This addresses the computational bottleneck for researchers and practitioners applying density-based clustering to large, high-dimensional data, offering a scalable solution with extensions to various distances.

The paper tackles the scalability problem of density-based clustering in high dimensions by introducing sDBSCAN, which uses random projections to speed up core point identification and achieves similar clustering to DBSCAN with high probability; empirically, it runs in minutes on million-point datasets, compared to hours or memory failures for existing methods.

We present sDBSCAN, a scalable density-based clustering algorithm in high dimensions with cosine distance. Utilizing the neighborhood-preserving property of random projections, sDBSCAN can quickly identify core points and their neighborhoods, the primary hurdle of density-based clustering. Theoretically, sDBSCAN outputs a clustering structure similar to DBSCAN under mild conditions with high probability. To further facilitate sDBSCAN, we present sOPTICS, a scalable OPTICS for interactive exploration of the intrinsic clustering structure. We also extend sDBSCAN and sOPTICS to L2, L1, $χ^2$, and Jensen-Shannon distances via random kernel features. Empirically, sDBSCAN is significantly faster and provides higher accuracy than many other clustering algorithms on real-world million-point data sets. On these data sets, sDBSCAN and sOPTICS run in a few minutes, while the scikit-learn's counterparts demand several hours or cannot run due to memory constraints.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes