LG DS MLNov 2, 2024

Relax and Merge: A Simple Yet Effective Framework for Solving Fair $k$-Means and $k$-sparse Wasserstein Barycenter Problems

Shihong Song, Guanlin Mo, Qingyuan Yang, Hu Ding

arXiv:2411.01115v32.6h-index: 3

Originality Highly original

AI Analysis

This addresses fairness in clustering for machine learning applications, offering improved theoretical guarantees and empirical performance over existing methods.

The paper tackles the fair k-means clustering problem by proposing a 'Relax and Merge' framework that achieves a (5+O(ε))-approximation ratio with slight fairness constraint violations, improving the state-of-the-art, and also applies to the k-sparse Wasserstein Barycenter problem with better approximations.

The fairness of clustering algorithms has gained widespread attention across various areas, including machine learning, In this paper, we study fair $k$-means clustering in Euclidean space. Given a dataset comprising several groups, the fairness constraint requires that each cluster should contain a proportion of points from each group within specified lower and upper bounds. Due to these fairness constraints, determining the optimal locations of $k$ centers is a quite challenging task. We propose a novel ``Relax and Merge'' framework that returns a $(1+4ρ+ O(ε))$-approximate solution, where $ρ$ is the approximate ratio of an off-the-shelf vanilla $k$-means algorithm and $O(ε)$ can be an arbitrarily small positive number. If equipped with a PTAS of $k$-means, our solution can achieve an approximation ratio of $(5+O(ε))$ with only a slight violation of the fairness constraints, which improves the current state-of-the-art approximation guarantee. Furthermore, using our framework, we can also obtain a $(1+4ρ+O(ε))$-approximate solution for the $k$-sparse Wasserstein Barycenter problem, which is a fundamental optimization problem in the field of optimal transport, and a $(2+6ρ)$-approximate solution for the strictly fair $k$-means clustering with no violation, both of which are better than the current state-of-the-art methods. In addition, the empirical results demonstrate that our proposed algorithm can significantly outperform baseline approaches in terms of clustering cost.

View on arXiv PDF

Similar