CCApr 27, 2023
A Parameterized Theory of PAC LearningCornelius Brand, Robert Ganian, Kirill Simonov
Probably Approximately Correct (i.e., PAC) learning is a core concept of sample complexity theory, and efficient PAC learnability is often seen as a natural counterpart to the class P in classical computational complexity. But while the nascent theory of parameterized complexity has allowed us to push beyond the P-NP ``dichotomy'' in classical computational complexity and identify the exact boundaries of tractability for numerous problems, there is no analogue in the domain of sample complexity that could push beyond efficient PAC learnability. As our core contribution, we fill this gap by developing a theory of parameterized PAC learning which allows us to shed new light on several recent PAC learning results that incorporated elements of parameterized complexity. Within the theory, we identify not one but two notions of fixed-parameter learnability that both form distinct counterparts to the class FPT -- the core concept at the center of the parameterized complexity paradigm -- and develop the machinery required to exclude fixed-parameter learnability. We then showcase the applications of this theory to identify refined boundaries of tractability for CNF and DNF learning as well as for a range of learning problems on graphs.
20.1CGMar 25
Line Cover and Related ProblemsMatthias Bentert, Fedor v. Fomin, Petr A. Golovach et al.
We study extensions of the classic \emph{Line Cover} problem, which asks whether a set of $n$ points in the plane can be covered using $k$ lines. Line Cover is known to be NP-hard, and we focus on two natural generalizations. The first is \textbf{Line Clustering}, where the goal is to find $k$ lines minimizing the sum of squared distances from the input points to their nearest line. The second is \textbf{Hyperplane Cover}, which asks whether $n$ points in $\mathbb{R}^d$ can be covered by $k$ hyperplanes. We also study the more general \textbf{Projective Clustering} problem, which unifies both settings and has applications in machine learning, data analysis, and computational geometry. In this problem, one seeks $k$ affine subspaces of dimension $r$ that minimize the sum of squared distances from the given points in $\mathbb{R}^d$ to the nearest subspace. Our results reveal notable differences in the parameterized complexity of these problems. While Line Cover is fixed-parameter tractable when parameterized by $k$, we show that Line Clustering is W[1]-hard with respect to $k$ and does not admit an algorithm with running time $n^{o(k)}$ unless the Exponential Time Hypothesis fails. Hyperplane Cover has been known to be NP-hard since the 1980s, following work of Megiddo and Tamir, even for $d=2$, we show that it remains NP-hard even when $k=2$. Finally, we present an algorithm for Projective Clustering running in $n^{O(dk(r+1))}$ time. This bound matches our lower bound for Line Clustering and generalizes the classic algorithm for $k$-Means Clustering ($r=0$) by Inaba, Katoh, and Imai [SoCG 1994].
CCSep 12, 2025
Parameterized Complexity of Vehicle RoutingMichelle Döring, Jan Fehse, Tobias Friedrich et al.
The Vehicle Routing Problem (VRP) is a popular generalization of the Traveling Salesperson Problem. Instead of one salesperson traversing the entire weighted, undirected graph $G$, there are $k$ vehicles available to jointly cover the set of clients $C \subseteq V(G)$. Every vehicle must start at one of the depot vertices $D \subseteq V(G)$ and return to its start. Capacitated Vehicle Routing (CVRP) additionally restricts the route of each vehicle by limiting the number of clients it can cover, the distance it can travel, or both. In this work, we study the complexity of VRP and the three variants of CVRP for several parameterizations, in particular focusing on the treewidth of $G$. We present an FPT algorithm for VRP parameterized by treewidth. For CVRP, we prove paraNP- and $W[\cdot]$-hardness for various parameterizations, including treewidth, thereby rendering the existence of FPT algorithms unlikely. In turn, we provide an XP algorithm for CVRP when parameterized by both treewidth and the vehicle capacity.
57.8DSMar 22
Finding Minimum Distance Preservers: A Parameterized StudyKirill Simonov, Farehe Soheil, Shaily Verma
For a given graph $G$ and a subset of vertices $S$, a \emph{distance preserver} is a subgraph of $G$ that preserves shortest paths between the vertices of $S$. We distinguish between a \emph{subsetwise} distance preserver, which preserves distances between all pairs in $S$, and a \emph{pairwise} distance preserver, which preserves distances only between specific pairs of vertices in $S$, given in the input. While a large body of work is dedicated to upper and lower bounds on the size of distance preservers and, more generally, graph spanners, the computational complexity of finding the minimum distance preserver has received comparatively little attention. We consider the respective \scup{Subsetwise Distance Preserver}\xspace (\scup{SDP}\xspace) and \scup{Pairwise Distance Preserver}\xspace (\scup{PDP}\xspace) problems and initiate the study of their computational complexity. We provide a detailed complexity landscape with respect to natural parameters, including the number of terminals, solution size, vertex cover, and treewidth. Our main contributions are as follows: \begin{itemize} \setlength{\itemsep}{0.5em} \item Both \scup{PDP}\xspace and \scup{SDP}\xspace are \nph\ even on subgraphs of the grid. Moreover, when parameterized by the number of terminals, the problems are \wh{1}\ on subgraphs of the grid, while they become \textsc{FPT}\ on full grids. \item \scup{PDP}\xspace is \nph\ on graphs of vertex cover $3$, while \scup{SDP}\xspace is \textsc{FPT}\ when parameterized by the vertex cover of the graph. Thus, the vertex cover parameter distinguishes the two variants. \item Both problems are \textsc{FPT}\ when parameterized by the number of terminals and the treewidth of the graph. \end{itemize}
DSDec 13, 2021
How to Find a Good Explanation for Clustering?Sayan Bandyapadhyay, Fedor V. Fomin, Petr A. Golovach et al.
$k$-means and $k$-median clustering are powerful unsupervised machine learning techniques. However, due to complicated dependences on all the features, it is challenging to interpret the resulting cluster assignments. Moshkovitz, Dasgupta, Rashtchian, and Frost [ICML 2020] proposed an elegant model of explainable $k$-means and $k$-median clustering. In this model, a decision tree with $k$ leaves provides a straightforward characterization of the data set into clusters. We study two natural algorithmic questions about explainable clustering. (1) For a given clustering, how to find the "best explanation" by using a decision tree with $k$ leaves? (2) For a given set of points, how to find a decision tree with $k$ leaves minimizing the $k$-means/median objective of the resulting explainable clustering? To address the first question, we introduce a new model of explainable clustering. Our model, inspired by the notion of outliers in robust statistics, is the following. We are seeking a small number of points (outliers) whose removal makes the existing clustering well-explainable. For addressing the second question, we initiate the study of the model of Moshkovitz et al. from the perspective of multivariate complexity. Our rigorous algorithmic analysis sheds some light on the influence of parameters like the input size, dimension of the data, the number of outliers, the number of clusters, and the approximation ratio, on the computational complexity of explainable clustering.
DSOct 19, 2020
EPTAS for $k$-means Clustering of Affine SubspacesEduard Eiben, Fedor V. Fomin, Petr A. Golovach et al.
We consider a generalization of the fundamental $k$-means clustering for data with incomplete or corrupted entries. When data objects are represented by points in $\mathbb{R}^d$, a data point is said to be incomplete when some of its entries are missing or unspecified. An incomplete data point with at most $Δ$ unspecified entries corresponds to an axis-parallel affine subspace of dimension at most $Δ$, called a $Δ$-point. Thus we seek a partition of $n$ input $Δ$-points into $k$ clusters minimizing the $k$-means objective. For $Δ=0$, when all coordinates of each point are specified, this is the usual $k$-means clustering. We give an algorithm that finds an $(1+ ε)$-approximate solution in time $f(k,ε, Δ) \cdot n^2 \cdot d$ for some function $f$ of $k,ε$, and $Δ$ only.
DSJul 20, 2020
On Coresets for Fair Clustering in Metric and Euclidean Spaces and Their ApplicationsSayan Bandyapadhyay, Fedor V. Fomin, Kirill Simonov
Fair clustering is a constrained variant of clustering where the goal is to partition a set of colored points, such that the fraction of points of any color in every cluster is more or less equal to the fraction of points of this color in the dataset. This variant was recently introduced by Chierichetti et al. [NeurIPS, 2017] in a seminal work and became widely popular in the clustering literature. In this paper, we propose a new construction of coresets for fair clustering based on random sampling. The new construction allows us to obtain the first coreset for fair clustering in general metric spaces. For Euclidean spaces, we obtain the first coreset whose size does not depend exponentially on the dimension. Our coreset results solve open questions proposed by Schmidt et al. [WAOA, 2019] and Huang et al. [NeurIPS, 2019]. The new coreset construction helps to design several new approximation and streaming algorithms. In particular, we obtain the first true constant-approximation algorithm for metric fair clustering, whose running time is fixed-parameter tractable (FPT). In the Euclidean case, we derive the first $(1+ε)$-approximation algorithm for fair clustering whose time complexity is near-linear and does not depend exponentially on the dimension of the space. Besides, our coreset construction scheme is fairly general and gives rise to coresets for a wide range of constrained clustering problems. This leads to improved constant-approximations for these problems in general metrics and near-linear time $(1+ε)$-approximations in the Euclidean metric.
DSMay 10, 2019
Refined Complexity of PCA with OutliersFedor V. Fomin, Petr A. Golovach, Fahad Panolan et al.
Principal component analysis (PCA) is one of the most fundamental procedures in exploratory data analysis and is the basic step in applications ranging from quantitative finance and bioinformatics to image analysis and neuroscience. However, it is well-documented that the applicability of PCA in many real scenarios could be constrained by an "immune deficiency" to outliers such as corrupted observations. We consider the following algorithmic question about the PCA with outliers. For a set of $n$ points in $\mathbb{R}^{d}$, how to learn a subset of points, say 1% of the total number of points, such that the remaining part of the points is best fit into some unknown $r$-dimensional subspace? We provide a rigorous algorithmic analysis of the problem. We show that the problem is solvable in time $n^{O(d^2)}$. In particular, for constant dimension the problem is solvable in polynomial time. We complement the algorithmic result by the lower bound, showing that unless Exponential Time Hypothesis fails, in time $f(d)n^{o(d)}$, for any function $f$ of $d$, it is impossible not only to solve the problem exactly but even to approximate it within a constant factor.