EPTAS for $k$-means Clustering of Affine Subspaces
This addresses clustering challenges in domains with incomplete data, such as data analysis or machine learning, but is incremental as it extends existing k-means methods to handle missing entries.
The paper tackles the problem of k-means clustering for incomplete or corrupted data by generalizing it to handle data points with missing entries, which are modeled as affine subspaces. It presents an algorithm that finds a (1+ε)-approximate solution in time f(k,ε,Δ)·n²·d, where f depends only on k, ε, and Δ.
We consider a generalization of the fundamental $k$-means clustering for data with incomplete or corrupted entries. When data objects are represented by points in $\mathbb{R}^d$, a data point is said to be incomplete when some of its entries are missing or unspecified. An incomplete data point with at most $Δ$ unspecified entries corresponds to an axis-parallel affine subspace of dimension at most $Δ$, called a $Δ$-point. Thus we seek a partition of $n$ input $Δ$-points into $k$ clusters minimizing the $k$-means objective. For $Δ=0$, when all coordinates of each point are specified, this is the usual $k$-means clustering. We give an algorithm that finds an $(1+ ε)$-approximate solution in time $f(k,ε, Δ) \cdot n^2 \cdot d$ for some function $f$ of $k,ε$, and $Δ$ only.