DSApr 6, 2023
Parameterized Approximation Schemes for Clustering with General Norm ObjectivesFateme Abbasi, Sandip Banerjee, Jarosław Byrka et al.
This paper considers the well-studied algorithmic regime of designing a $(1+ε)$-approximation algorithm for a $k$-clustering problem that runs in time $f(k,ε)poly(n)$ (sometimes called an efficient parameterized approximation scheme or EPAS for short). Notable results of this kind include EPASes in the high-dimensional Euclidean setting for $k$-center [Badŏiu, Har-Peled, Indyk; STOC'02] as well as $k$-median, and $k$-means [Kumar, Sabharwal, Sen; J. ACM 2010]. However, existing EPASes handle only basic objectives (such as $k$-center, $k$-median, and $k$-means) and are tailored to the specific objective and metric space. Our main contribution is a clean and simple EPAS that settles more than ten clustering problems (across multiple well-studied objectives as well as metric spaces) and unifies well-known EPASes. Our algorithm gives EPASes for a large variety of clustering objectives (for example, $k$-means, $k$-center, $k$-median, priority $k$-center, $\ell$-centrum, ordered $k$-median, socially fair $k$-median aka robust $k$-median, or more generally monotone norm $k$-clustering) and metric spaces (for example, continuous high-dimensional Euclidean spaces, metrics of bounded doubling dimension, bounded treewidth metrics, and planar metrics). Key to our approach is a new concept that we call bounded $ε$-scatter dimension--an intrinsic complexity measure of a metric space that is a relaxation of the standard notion of bounded doubling dimension. Our main technical result shows that two conditions are essentially sufficient for our algorithm to yield an EPAS on the input metric $M$ for any clustering objective: (i) The objective is described by a monotone (not necessarily symmetric!) norm, and (ii) the $ε$-scatter dimension of $M$ is upper bounded by a function of $ε$.
51.8DSApr 24
FPT Approximations for Fair Sum of Radii with Outliers and General Norm ObjectivesAmeet Gadekar
The sum of radii problem is a classical clustering problem in which, given a set $X$ of points and an integer $k$, the goal is to place $k$ balls that cover $X$ while minimizing the sum of their radii. Recent work has focused on incorporating modern constraints such as fairness and robustness, motivated by biased and noisy data. We study the fair sum of radii with outliers problem, where the chosen centers must satisfy group-based representation constraints while allowing up to $z$ points to be excluded. We present a $(3+ε)$-approximation algorithm that runs in fixed-parameter tractable time parameterized by $k$. Our framework extends to the more general setting where the objective is a monotone symmetric norm of the radii, achieving a $(3+ε)$-approximation for any fixed norm; this guarantee is tight under Gap-ETH. Moreover, the algorithm is oblivious to the choice of norm: it outputs a small list of candidate solutions such that, for every monotone symmetric norm $f$, the list contains a $(3+ε)$-approximate solution under $f$. Our approach is based on a novel iterative ball-finding framework that uncovers a structural trichotomy in the optimal clustering, enabling us to directly construct fair solutions while handling outliers. Finally, we extend our techniques to the more general fair-range setting, where each group is subject to both lower and upper bounds.
DSDec 17, 2025
Label-consistent clustering for evolving dataAmeet Gadekar, Aristides Gionis, Thibault Marette
Data analysis often involves an iterative process, where solutions must be continuously refined in response to new data. Typically, as new data becomes available, an existing solution must be updated to incorporate the latest information. In addition to seeking a high-quality solution for the task at hand, it is also crucial to ensure consistency by minimizing drastic changes from previous solutions. Applying this approach across many iterations, ensures that the solution evolves gradually and smoothly. In this paper, we study the above problem in the context of clustering, specifically focusing on the $k$-center problem. More precisely, we study the following problem: Given a set of points $X$, parameters $k$ and $b$, and a prior clustering solution $H$ for $X$, our goal is to compute a new solution $C$ for $X$, consisting of $k$ centers, which minimizes the clustering cost while introducing at most $b$ changes from $H$. We refer to this problem as label-consistent $k$-center, and we propose two constant-factor approximation algorithms for it. We complement our theoretical findings with an experimental evaluation demonstrating the effectiveness of our methods on real-world datasets.
LGOct 16, 2024Code
Fair Clustering for Data Summarization: Improved Approximation Algorithms and Complexity InsightsAmeet Gadekar, Aristides Gionis, Suhas Thejaswi
Data summarization tasks are often modeled as $k$-clustering problems, where the goal is to choose $k$ data points, called cluster centers, that best represent the dataset by minimizing a clustering objective. A popular objective is to minimize the maximum distance between any data point and its nearest center, which is formalized as the $k$-center problem. While in some applications all data points can be chosen as centers, in the general setting, centers must be chosen from a predefined subset of points, referred as facilities or suppliers; this is known as the $k$-supplier problem. In this work, we focus on fair data summarization modeled as the fair $k$-supplier problem, where data consists of several groups, and a minimum number of centers must be selected from each group while minimizing the $k$-supplier objective. The groups can be disjoint or overlapping, leading to two distinct problem variants each with different computational complexity. We present $3$-approximation algorithms for both variants, improving the previously known factor of $5$. For disjoint groups, our algorithm runs in polynomial time, while for overlapping groups, we present a fixed-parameter tractable algorithm, where the exponential runtime depends only on the number of groups and centers. We show that these approximation factors match the theoretical lower bounds, assuming standard complexity theory conjectures. Finally, using an open-source implementation, we demonstrate the scalability of our algorithms on large synthetic datasets and assess the price of fairness on real-world data, comparing solution quality with and without fairness constraints.
49.9DSMay 7
On the Parameterized Approximability of (Mergeable) Sum of Radii ClusteringAmeet Gadekar
The sum of radii problem ($k$-MSR) asks, given a metric space on $n$ points, to place $k$ balls covering all points so as to minimize the sum of their radii. Despite extensive study from the perspectives of approximation and parameterized algorithms, the exact parameterized complexity of the problem and the existence of efficient parameterized approximation schemes remained open. We advance this understanding on both the hardness and algorithmic fronts. We begin by showing that $k$-MSR is $W[2]$-hard parameterized by $k$, thereby pinpointing its location in the $W$-hierarchy. Moreover, via our reduction, we rule out efficient parameterized approximation schemes (EPAS)--that is, $(1+ε)$-approximations running in time $f(k,ε)\cdot \mathrm{poly}(n)$--unless $W[2] = FPT$. Assuming the Exponential Time Hypothesis, we further rule out such algorithms running in time $f(k,ε)\cdot n^{o(k)}$, strengthening recent lower bounds for the problem. On the algorithmic side, we study $k$-MSR under the framework of mergeable constraints, which captures a broad class of clustering constraints, including fairness, diversity, and lower bounds. We obtain an FPT $(\frac{8}{3}+ε)$-approximation, improving upon the previous best guarantee of $(4+ε)$. Moreover, given access to a suitable assignment subroutine, we achieve a $(2+ε)$-approximation, matching the best known bound for the unconstrained problem. This, in turn, yields $(2+ε)$ FPT-approximations for several important settings, including $(t,k)$-fair, $(α,β)$-fair, $\ell$-diversity, and private clustering.
DSJan 10, 2024
Diversity-aware clustering: Computational Complexity and Approximation AlgorithmsSuhas Thejaswi, Ameet Gadekar, Bruno Ordozgoiti et al.
In this work, we study diversity-aware clustering problems where the data points are associated with multiple attributes resulting in intersecting groups. A clustering solution needs to ensure that the number of chosen cluster centers from each group should be within the range defined by a lower and upper bound threshold for each group, while simultaneously minimizing the clustering objective, which can be either $k$-median, $k$-means or $k$-supplier. We study the computational complexity of the proposed problems, offering insights into their NP-hardness, polynomial-time inapproximability, and fixed-parameter intractability. We present parameterized approximation algorithms with approximation ratios $1+ \frac{2}{e} + ε\approx 1.736$, $1+\frac{8}{e} + ε\approx 3.943$, and $5$ for diversity-aware $k$-median, diversity-aware $k$-means and diversity-aware $k$-supplier, respectively. Assuming Gap-ETH, the approximation ratios are tight for the diversity-aware $k$-median and diversity-aware $k$-means problems. Our results imply the same approximation factors for their respective fair variants with disjoint groups -- fair $k$-median, fair $k$-means, and fair $k$-supplier -- with lower bound requirements.
DSMay 12, 2023
Parameterized Approximation for Robust Clustering in Discrete Geometric SpacesFateme Abbasi, Sandip Banerjee, Jarosław Byrka et al.
We consider the well-studied Robust $(k, z)$-Clustering problem, which generalizes the classic $k$-Median, $k$-Means, and $k$-Center problems. Given a constant $z\ge 1$, the input to Robust $(k, z)$-Clustering is a set $P$ of $n$ weighted points in a metric space $(M,δ)$ and a positive integer $k$. Further, each point belongs to one (or more) of the $m$ many different groups $S_1,S_2,\ldots,S_m$. Our goal is to find a set $X$ of $k$ centers such that $\max_{i \in [m]} \sum_{p \in S_i} w(p) δ(p,X)^z$ is minimized. This problem arises in the domains of robust optimization [Anthony, Goyal, Gupta, Nagarajan, Math. Oper. Res. 2010] and in algorithmic fairness. For polynomial time computation, an approximation factor of $O(\log m/\log\log m)$ is known [Makarychev, Vakilian, COLT $2021$], which is tight under a plausible complexity assumption even in the line metrics. For FPT time, there is a $(3^z+ε)$-approximation algorithm, which is tight under GAP-ETH [Goyal, Jaiswal, Inf. Proc. Letters, 2023]. Motivated by the tight lower bounds for general discrete metrics, we focus on \emph{geometric} spaces such as the (discrete) high-dimensional Euclidean setting and metrics of low doubling dimension, which play an important role in data analysis applications. First, for a universal constant $η_0 >0.0006$, we devise a $3^z(1-η_{0})$-factor FPT approximation algorithm for discrete high-dimensional Euclidean spaces thereby bypassing the lower bound for general metrics. We complement this result by showing that even the special case of $k$-Center in dimension $Θ(\log n)$ is $(\sqrt{3/2}- o(1))$-hard to approximate for FPT algorithms. Finally, we complete the FPT approximation landscape by designing an FPT $(1+ε)$-approximation scheme (EPAS) for the metric of sub-logarithmic doubling dimension.
DSFeb 18, 2015
On learning k-parities with and without noiseArnab Bhattacharyya, Ameet Gadekar, Ninad Rajgopal
We first consider the problem of learning $k$-parities in the on-line mistake-bound model: given a hidden vector $x \in \{0,1\}^n$ with $|x|=k$ and a sequence of "questions" $a_1, a_2, ...\in \{0,1\}^n$, where the algorithm must reply to each question with $< a_i, x> \pmod 2$, what is the best tradeoff between the number of mistakes made by the algorithm and its time complexity? We improve the previous best result of Buhrman et al. by an $\exp(k)$ factor in the time complexity. Second, we consider the problem of learning $k$-parities in the presence of classification noise of rate $η\in (0,1/2)$. A polynomial time algorithm for this problem (when $η> 0$ and $k = ω(1)$) is a longstanding challenge in learning theory. Grigorescu et al. showed an algorithm running in time ${n \choose k/2}^{1 + 4η^2 +o(1)}$. Note that this algorithm inherently requires time ${n \choose k/2}$ even when the noise rate $η$ is polynomially small. We observe that for sufficiently small noise rate, it is possible to break the $n \choose k/2$ barrier. In particular, if for some function $f(n) = ω(1)$ and $α\in [1/2, 1)$, $k = n/f(n)$ and $η= o(f(n)^{- α}/\log n)$, then there is an algorithm for the problem with running time $poly(n)\cdot {n \choose k}^{1-α} \cdot e^{-k/4.01}$.