Lumbermark: Resistant Clustering by Chopping Up Mutual Reachability Minimum Spanning Trees
This provides an incremental improvement for data scientists needing robust clustering tools, offering an alternative to methods like HDBSCAN.
The paper tackles the problem of robust clustering for datasets with varying cluster characteristics by introducing Lumbermark, a divisive algorithm that uses mutual reachability minimum spanning trees to reduce noise influence and produce partitions with user-specified sizes, showing good performance on benchmark data.
We introduce Lumbermark, a robust divisive clustering algorithm capable of detecting clusters of varying sizes, densities, and shapes. Lumbermark iteratively chops off large limbs connected by protruding segments of a dataset's mutual reachability minimum spanning tree. The use of mutual reachability distances smoothens the data distribution and decreases the influence of low-density objects, such as noise points between clusters or outliers at their peripheries. The algorithm can be viewed as an alternative to HDBSCAN that produces partitions with user-specified sizes. A fast, easy-to-use implementation of the new method is available in the open-source 'lumbermark' package for Python and R. We show that Lumbermark performs well on benchmark data and hope it will prove useful to data scientists and practitioners across different fields.