LGSep 20, 2022

Sanity Check for External Clustering Validation Benchmarks using Internal Validation Measures

Hyeon Jeon, Michael Aupetit, DongHwa Shin, Aeri Cho, Seokhyeon Park, Jinwook Seo

arXiv:2209.10042v17.86 citationsh-index: 27Has Code

Originality Incremental advance

AI Analysis

This work addresses a fundamental issue in clustering validation for researchers and practitioners, though it is incremental as it builds on existing internal validation measures.

The paper tackles the unreliability of benchmarking clustering techniques by addressing the flawed assumption that class labels correspond to well-separated clusters, proposing a method to evaluate this assumption across datasets using a generalized Calinski-Harabasz index, and demonstrating its accuracy and necessity in experiments.

We address the lack of reliability in benchmarking clustering techniques based on labeled datasets. A standard scheme in external clustering validation is to use class labels as ground truth clusters, based on the assumption that each class forms a single, clearly separated cluster. However, as such cluster-label matching (CLM) assumption often breaks, the lack of conducting a sanity check for the CLM of benchmark datasets casts doubt on the validity of external validations. Still, evaluating the degree of CLM is challenging. For example, internal clustering validation measures can be used to quantify CLM within the same dataset to evaluate its different clusterings but are not designed to compare clusterings of different datasets. In this work, we propose a principled way to generate between-dataset internal measures that enable the comparison of CLM across datasets. We first determine four axioms for between-dataset internal measures, complementing Ackerman and Ben-David's within-dataset axioms. We then propose processes to generalize internal measures to fulfill these new axioms, and use them to extend the widely used Calinski-Harabasz index for between-dataset CLM evaluation. Through quantitative experiments, we (1) verify the validity and necessity of the generalization processes and (2) show that the proposed between-dataset Calinski-Harabasz index accurately evaluates CLM across datasets. Finally, we demonstrate the importance of evaluating CLM of benchmark datasets before conducting external validation.

View on arXiv PDF Code

Similar