MLITLGSPNov 21, 2024

Exponentially Consistent Nonparametric Linkage-Based Clustering of Data Sequences

arXiv:2411.13922v43 citationsh-index: 17IEEE Transactions on Signal Processing
Originality Incremental advance
AI Analysis

This work addresses clustering problems in statistics and machine learning by improving theoretical guarantees for linkage-based methods, though it appears incremental as it relaxes assumptions rather than introducing a new paradigm.

The paper tackles nonparametric clustering of data sequences from unknown distributions, showing that single linkage-based (SLINK) clustering achieves exponential consistency under a less strict assumption than prior methods, and proposes a sequential algorithm (SLINK-SEQ) that reduces sample requirements in simulations.

In this paper, we consider nonparametric clustering of $M$ independent and identically distributed (i.i.d.) data sequences generated from {\em unknown} distributions. The distributions of the $M$ data sequences belong to $K$ underlying distribution clusters. Existing results on exponentially consistent nonparametric clustering algorithms, like single linkage-based (SLINK) clustering and $k$-medoids distribution clustering, assume that the maximum intra-cluster distance ($d_L$) is smaller than the minimum inter-cluster distance ($d_H$). First, in the fixed sample size (FSS) setting, we show that exponential consistency can be achieved for SLINK clustering under a less strict assumption, $d_I < d_H$, where $d_I$ is the maximum distance between any two sub-clusters of a cluster that partition the cluster. Note that $d_I < d_L$ in general. Thus, our results show that SLINK is exponentially consistent for a larger class of problems than previously known. In our simulations, we also identify examples where $k$-medoids clustering is unable to find the true clusters, but SLINK is exponentially consistent. Then, we propose a sequential clustering algorithm, named SLINK-SEQ, based on SLINK and prove that it is also exponentially consistent. Simulation results show that the SLINK-SEQ algorithm requires fewer expected number of samples than the FSS SLINK algorithm for the same probability of error.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes