LGDMMLMar 14, 2022

Geometric reconstructions of density based clusterings

arXiv:2203.08020v1h-index: 2
Originality Incremental advance
AI Analysis

This addresses scalability issues for density-based clustering in real-world applications, such as geospatial data analysis, though it appears incremental as it builds on existing algorithms.

The paper tackles the problem of clustering very large datasets with DBSCAN* and HDBSCAN*, which is infeasible with standard implementations, by proving that clusters can be constructed from specific subsets using Euclidean geometry, enabling clustering of large datasets like the Microsoft Building Footprint Database of the US.

DBSCAN* and HDBSCAN* are well established density based clustering algorithms. However, obtaining the clusters of very large datasets is infeasible, limiting their use in real world applications. By exploiting the geometry of Euclidean space, we prove that it is possible to systematically construct the DBSCAN* and HDBSCAN* clusters of a finite $X\subset \mathbb{R}^n$ from specific subsets of $X$. We are able to control the size of these subsets and therefore our results make it possible to cluster very large datasets. To illustrate our theory, we cluster the Microsoft Building Footprint Database of the US, which is not possible using the standard implementations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes