A Distribution Testing Approach to Clustering Distributions
This work provides theoretical bounds on the sample complexity for clustering distributions, which is important for researchers working on distribution testing and clustering algorithms.
This paper addresses the problem of clustering $k$ distributions into two groups, where distributions within each group are identical and the two group distributions are $\\varepsilon$-far in total variation. The authors establish tight upper and lower bounds on the sample complexity for both cases where one or both cluster distributions are unknown, characterizing dependence on domain size $n$, number of distributions $k$, cluster size $r$, and distance $\\varepsilon$ up to an $O(\\log k)$ factor.
We study the following distribution clustering problem: Given a hidden partition of $k$ distributions into two groups, such that the distributions within each group are the same, and the two distributions associated with the two clusters are $\varepsilon$-far in total variation, the goal is to recover the partition. We establish upper and lower bounds on the sample complexity for two fundamental cases: (1) when one of the cluster's distributions is known, and (2) when both are unknown. Our upper and lower bounds characterize the sample complexity's dependence on the domain size $n$, number of distributions $k$, size $r$ of one of the clusters, and distance $\varepsilon$. In particular, we achieve tightness with respect to $(n,k,r,\varepsilon)$ (up to an $O(\log k)$ factor) for all regimes.