CG LGOct 2, 2018

A Unified Framework for Clustering Constrained Data without Locality Property

arXiv:1810.01049v111.715 citations

Originality Highly original

AI Analysis

This addresses high-dimensional constrained clustering problems for data analysis applications, representing a novel method for a known bottleneck rather than an incremental improvement.

The paper tackles constrained clustering problems where optimal solutions lack the locality property, presenting a unified Peeling-and-Enclosing (PnE) framework that achieves (1+ε)-approximation for constrained k-means and k-median clustering in nearly linear time O(n(log n)^{k+1}d) when k and 1/ε are fixed.

In this paper, we consider a class of constrained clustering problems of points in $\mathbb{R}^{d}$, where $d$ could be rather high. A common feature of these problems is that their optimal clusterings no longer have the locality property (due to the additional constraints), which is a key property required by many algorithms for their unconstrained counterparts. To overcome the difficulty caused by the loss of locality, we present in this paper a unified framework, called {\em Peeling-and-Enclosing (PnE)}, to iteratively solve two variants of the constrained clustering problems, {\em constrained $k$-means clustering} ($k$-CMeans) and {\em constrained $k$-median clustering} ($k$-CMedian). Our framework is based on two standalone geometric techniques, called {\em Simplex Lemma} and {\em Weaker Simplex Lemma}, for $k$-CMeans and $k$-CMedian, respectively. The simplex lemma (or weaker simplex lemma) enables us to efficiently approximate the mean (or median) point of an unknown set of points by searching a small-size grid, independent of the dimensionality of the space, in a simplex (or the surrounding region of a simplex), and thus can be used to handle high dimensional data. If $k$ and $\frac{1}ε$ are fixed numbers, our framework generates, in nearly linear time ({\em i.e.,} $O(n(\log n)^{k+1}d)$), $O((\log n)^{k})$ $k$-tuple candidates for the $k$ mean or median points, and one of them induces a $(1+ε)$-approximation for $k$-CMeans or $k$-CMedian, where $n$ is the number of points. Combining this unified framework with a problem-specific selection algorithm (which determines the best $k$-tuple candidate), we obtain a $(1+ε)$-approximation for each of the constrained clustering problems. We expect that our technique will be applicable to other constrained clustering problems without locality.

View on arXiv PDF

Similar