LGJun 12, 2023

A Computational Theory and Semi-Supervised Algorithm for Clustering

arXiv:2306.06974v21 citationsh-index: 3

Originality Incremental advance

AI Analysis

This work addresses clustering challenges for data analysts by providing a parameter-free and efficient method, though it appears incremental in its approach.

The paper tackles the problem of clustering by defining it as the dual of anomaly detection and presenting a semi-supervised algorithm that uses known relationships as seeds to guide the process, achieving advantages over popular methods on synthetic and real-world datasets.

A computational theory for clustering and a semi-supervised clustering algorithm is presented. Clustering is defined to be the obtainment of groupings of data such that each group contains no anomalies with respect to a chosen grouping principle and measure; all other examples are considered to be fringe points, isolated anomalies, anomalous clusters or unknown clusters. More precisely, after appropriate modelling under the assumption of uniform random distribution, any example whose expectation of occurrence is <1 with respect to a group is considered an anomaly; otherwise it is assigned a membership of that group. Thus, clustering is conceived as the dual of anomaly detection. The representation of data is taken to be the Euclidean distance of a point to a cluster median. This is due to the robustness properties of the median to outliers, its approximate location of centrality and so that decision boundaries are general purpose. The kernel of the clustering method is the perception anomaly detection algorithm, resulting in a parameter-free, fast, and efficient clustering algorithm. Acknowledging that clustering is an interactive and iterative process, the algorithm relies on a small fraction of known relationships between examples. These relationships serve as seeds to define the user's objectives and guide the clustering process. The method then expands the clusters accordingly, leaving the remaining examples for exploration and subsequent iterations. Results are presented on synthetic and realworld data sets, demonstrating the advantages over the most popular unsupervised and semi-supervised clustering methods.

View on arXiv PDF

Similar