Fast and explainable clustering in the Manhattan and Tanimoto distance
This work provides a faster and more explainable clustering method for domains like cheminformatics, though it is incremental as it adapts an existing algorithm to new metrics.
The authors extended the CLASSIX clustering algorithm to support Manhattan and Tanimoto distances, using norms and triangle inequalities for sorting and search termination, resulting in up to 80 times faster clustering with higher quality on a chemical fingerprint benchmark.
The CLASSIX algorithm is a fast and explainable approach to data clustering. In its original form, this algorithm exploits the sorting of the data points by their first principal component to truncate the search for nearby data points, with nearness being defined in terms of the Euclidean distance. Here we extend CLASSIX to other distance metrics, including the Manhattan distance and the Tanimoto distance. Instead of principal components, we use an appropriate norm of the data vectors as the sorting criterion, combined with the triangle inequality for search termination. In the case of Tanimoto distance, a provably sharper intersection inequality is used to further boost the performance of the new algorithm. On a real-world chemical fingerprint benchmark, CLASSIX Tanimoto is about 30 times faster than the Taylor--Butina algorithm, and about 80 times faster than DBSCAN, while computing higher-quality clusters in both cases.