IRDBLGFeb 6, 2023

Learned Accelerator Framework for Angular-Distance-Based High-Dimensional DBSCAN

arXiv:2302.03136v11 citationsh-index: 24
AI Analysis

This work addresses efficiency and quality issues in density-based clustering for high-dimensional neural embeddings, representing an incremental improvement over existing methods.

The paper tackles the degraded performance of DBSCAN on high-dimensional data by proposing LAF, a learned accelerator framework that speeds up DBSCAN and its variants using angular distance, achieving state-of-the-art efficiency and quality improvements.

Density-based clustering is a commonly used tool in data science. Today many data science works are utilizing high-dimensional neural embeddings. However, traditional density-based clustering techniques like DBSCAN have a degraded performance on high-dimensional data. In this paper, we propose LAF, a generic learned accelerator framework to speed up the original DBSCAN and the sampling-based variants of DBSCAN on high-dimensional data with angular distance metric. This framework consists of a learned cardinality estimator and a post-processing module. The cardinality estimator can fast predict whether a data point is core or not to skip unnecessary range queries, while the post-processing module detects the false negative predictions and merges the falsely separated clusters. The evaluation shows our LAF-enhanced DBSCAN method outperforms the state-of-the-art efficient DBSCAN variants on both efficiency and quality.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes