Near-optimal Algorithms for Explainable k-Medians and k-Means
This addresses the need for interpretable clustering in machine learning, offering significant algorithmic improvements over prior work.
The paper tackles the problem of explainable clustering using threshold decision trees for k-medians and k-means, achieving improved competitive ratios of ~O(log k) for k-medians and ~O(k) for k-means, compared to previous O(k) and O(k^2) guarantees, and provides near-optimal lower bounds.
We consider the problem of explainable $k$-medians and $k$-means introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian~(ICML 2020). In this problem, our goal is to find a threshold decision tree that partitions data into $k$ clusters and minimizes the $k$-medians or $k$-means objective. The obtained clustering is easy to interpret because every decision node of a threshold tree splits data based on a single feature into two groups. We propose a new algorithm for this problem which is $\tilde O(\log k)$ competitive with $k$-medians with $\ell_1$ norm and $\tilde O(k)$ competitive with $k$-means. This is an improvement over the previous guarantees of $O(k)$ and $O(k^2)$ by Dasgupta et al (2020). We also provide a new algorithm which is $O(\log^{3/2} k)$ competitive for $k$-medians with $\ell_2$ norm. Our first algorithm is near-optimal: Dasgupta et al (2020) showed a lower bound of $Ω(\log k)$ for $k$-medians; in this work, we prove a lower bound of $\tildeΩ(k)$ for $k$-means. We also provide a lower bound of $Ω(\log k)$ for $k$-medians with $\ell_2$ norm.