LGJun 25, 2023

Evolution of $K$-means solution landscapes with the addition of dataset outliers and a robust clustering comparison measure for their analysis

arXiv:2306.14346v11 citationsh-index: 79
Originality Incremental advance
AI Analysis

This work addresses the impact of outliers on clustering stability for data scientists, offering a novel comparison measure, though it is incremental in improving robustness.

The study investigated how adding outliers to datasets affects the solution landscape of the K-means clustering algorithm, finding that the cost function becomes more funnelled with longer pathways and reduced correlation between accuracy and cost. It also proposed a robust clustering similarity measure based on kinetic analysis rates, which is demonstrated on datasets with multiple outliers.

The $K$-means algorithm remains one of the most widely-used clustering methods due to its simplicity and general utility. The performance of $K$-means depends upon location of minima low in cost function, amongst a potentially vast number of solutions. Here, we use the energy landscape approach to map the change in $K$-means solution space as a result of increasing dataset outliers and show that the cost function surface becomes more funnelled. Kinetic analysis reveals that in all cases the overall funnel is composed of shallow locally-funnelled regions, each of which are separated by areas that do not support any clustering solutions. These shallow regions correspond to different types of clustering solution and their increasing number with outliers leads to longer pathways within the funnel and a reduced correlation between accuracy and cost function. Finally, we propose that the rates obtained from kinetic analysis provide a novel measure of clustering similarity that incorporates information about the paths between them. This measure is robust to outliers and we illustrate the application to datasets containing multiple outliers.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes