Shaofeng H. -C. Jiang

DS
6papers
124citations
Novelty65%
AI Score51

6 Papers

QUANT-PHJun 5, 2023
Near-Optimal Quantum Coreset Construction Algorithms for Clustering

Yecheng Xue, Xiaoyu Chen, Tongyang Li et al.

$k$-Clustering in $\mathbb{R}^d$ (e.g., $k$-median and $k$-means) is a fundamental machine learning problem. While near-linear time approximation algorithms were known in the classical setting for a dataset with cardinality $n$, it remains open to find sublinear-time quantum algorithms. We give quantum algorithms that find coresets for $k$-clustering in $\mathbb{R}^d$ with $\tilde{O}(\sqrt{nk}d^{3/2})$ query complexity. Our coreset reduces the input size from $n$ to $\mathrm{poly}(kε^{-1}d)$, so that existing $α$-approximation algorithms for clustering can run on top of it and yield $(1 + ε)α$-approximation. This eventually yields a quadratic speedup for various $k$-clustering approximation algorithms. We complement our algorithm with a nearly matching lower bound, that any quantum algorithm must make $Ω(\sqrt{nk})$ queries in order to achieve even $O(1)$-approximation for $k$-clustering.

LGOct 1, 2022
On The Relative Error of Random Fourier Features for Preserving Kernel Distance

Kuan Cheng, Shaofeng H. -C. Jiang, Luojian Wei et al.

The method of random Fourier features (RFF), proposed in a seminal paper by Rahimi and Recht (NIPS'07), is a powerful technique to find approximate low-dimensional representations of points in (high-dimensional) kernel space, for shift-invariant kernels. While RFF has been analyzed under various notions of error guarantee, the ability to preserve the kernel distance with \emph{relative} error is less understood. We show that for a significant range of kernels, including the well-known Laplacian kernels, RFF cannot approximate the kernel distance with small relative error using low dimensions. We complement this by showing as long as the shift-invariant kernel is analytic, RFF with $\mathrm{poly}(ε^{-1} \log n)$ dimensions achieves $ε$-relative error for pairwise kernel distance of $n$ points, and the dimension bound is improved to $\mathrm{poly}(ε^{-1}\log k)$ for the specific application of kernel $k$-means. Finally, going beyond RFF, we make the first step towards data-oblivious dimension-reduction for general shift-invariant kernels, and we obtain a similar $\mathrm{poly}(ε^{-1} \log n)$ dimension bound for Laplacian kernels. We also validate the dimension-error tradeoff of our methods on simulated datasets, and they demonstrate superior performance compared with other popular methods including random-projection and Nyström methods.

72.3DSApr 30
Streaming Max-Cut in General Metrics

Shaofeng H. -C. Jiang, Pan Peng, Haoze Wang

Max-Cut is a fundamental combinatorial optimization problem that has been studied in various computational settings. We initiate the study of its streaming complexity in \emph{general metric spaces} with access to distance oracles. We give a $(1 + ε)$-approximate algorithm for estimating the Max-Cut value in \emph{sliding-window} streams using only poly-logarithmic space. This is the first sliding-window algorithm for Max-Cut even in Euclidean spaces, and it matches a known insertion-only space bound in the special case of Euclidean spaces [Chen, Jiang, Krauthgamer, STOC'23]. In sharp contrast, we give a $\poly(n)$-space lower bound in the \emph{dynamic} streaming setting. This yields a separation from the Euclidean case, where the polylogarithmic-space $(1+ε)$-approximation extends to dynamic streams. On the technical side, our sliding-window algorithm builds on the smooth histogram framework of [Braverman and Ostrovsky, SICOMP'10]. To make this framework applicable, we establish the first smoothness bound for metric Max-Cut. Moreover, we develop a streaming algorithm for metric Max-Cut in insertion-only streams, whose key ingredient is a new metric reservoir sampling technique.

70.5DSApr 2
Fully Dynamic Euclidean k-Means

Sayan Bhattacharya, Martín Costa, Ermiya Farokhnejad et al.

We consider the Euclidean $k$-means clustering problem in a dynamic setting, where we have to explicitly maintain a solution (a set of $k$ centers) $S \subseteq \mathbb{R}^d$ subject to point insertions/deletions in $\mathbb{R}^d$. We present a dynamic algorithm for Euclidean $k$-means with $\mathrm{poly}(1/ε)$-approximation ratio, $\tilde{O}(k^ε)$ update time, and $\tilde{O}(1)$ recourse, for any $ε\in (0,1)$, even when $d$ and $k$ are both part of the input. This is the first algorithm to achieve a constant ratio with $o(k)$ update time for this problem, whereas the previous $O(1)$-approximation runs in $\tilde O(k)$ update time [Bhattacharya, Costa, Farokhnejad; STOC'25]. In fact, previous algorithms cannot go beyond $O(k)$ update time precisely because they are designed for general metrics where an $Ω(k)$ lower bound is known. We break this $O(k)$ barrier by devising new fundamental data structures to utilize Euclidean properties: a structure that (implicitly) maintains a clustering subject to both center and data point updates, and a range query structure that can evaluate a mergeable function over any metric ball range given as a query. To obtain these structures, we devise the first consistent hashing scheme [Czumaj, Jiang, Krauthgamer, Vesel{ý}, Yang; FOCS'22] that achieves $\tilde O(n^ε)$ running time per point evaluation with competitive parameters. Our final algorithm exploits the framework of [Bhattacharya, Costa, Farokhnejad; STOC'25] for general metrics. The key change is to redesign several critical subroutines so that they reduce to our new Euclidean data structures, replacing the general-metric implementations that are unlikely to run efficiently even when Euclidean properties are provided.

46.5DSApr 1
Round-efficient Fully-scalable MPC algorithms for k-Means

Shaofeng H. -C. Jiang, Yaonan Jin, Jianing Lou et al.

We study Euclidean $k$-Means under the Massively Parallel Computation (MPC) model, focusing on the \emph{fully-scalable} setting. Our main result is a fully-scalable $O((\log n/\log\log n)^2)$-approximation in $O(1)$ rounds. Previously, fully-scalable algorithms for $k$-Means either run in super-constant $O(\log\log n \cdot \log\log\log n)$ rounds, albeit with a better $O(1)$-approximation [Cohen-Addad et al., SODA'26], or suffer from bicriteria guarantees [Bhaskara and Wijewardena, ICML'18; Czumaj et al., ICALP'24]. Our algorithm also gives an $O(\log n/\log\log n)$-approximation for $k$-Median, which improves a recent $O(\log n)$-approximation [Goranci et al., SODA'26], and this $o(\log n)$ ratio breaks the fundamental barrier of tree embedding methods used therein. Our main technical contribution is a new variant of the MP algorithm [Mettu and Plaxton, SICOMP'03] that works for general metrics, whose new guarantee is the Lagrangian Multiplier Preserving (LMP) property, which, importantly, holds even under arbitrary distance distortions. Allowing distance distortion is crucial for efficient MPC implementations and useful for efficient algorithm design in general, whereas preserving the LMP property under distance distortion is known to be a significant technical challenge. As a byproduct of our techniques, we also obtain an $O(1)$-approximation to the optimal \emph{value} in $O(1)$ rounds, which conceptually suggests that achieving a true $O(1)$-approximation (for the solution) in $O(1)$ rounds may be a sensible goal for future study.

DSJun 20, 2019
Coresets for Clustering with Fairness Constraints

Lingxiao Huang, Shaofeng H. -C. Jiang, Nisheeth K. Vishnoi

In a recent work, [19] studied the following "fair" variants of classical clustering problems such as $k$-means and $k$-median: given a set of $n$ data points in $\mathbb{R}^d$ and a binary type associated to each data point, the goal is to cluster the points while ensuring that the proportion of each type in each cluster is roughly the same as its underlying proportion. Subsequent work has focused on either extending this setting to when each data point has multiple, non-disjoint sensitive types such as race and gender [6], or to address the problem that the clustering algorithms in the above work do not scale well. The main contribution of this paper is an approach to clustering with fairness constraints that involve multiple, non-disjoint types, that is also scalable. Our approach is based on novel constructions of coresets: for the $k$-median objective, we construct an $\varepsilon$-coreset of size $O(Γk^2 \varepsilon^{-d})$ where $Γ$ is the number of distinct collections of groups that a point may belong to, and for the $k$-means objective, we show how to construct an $\varepsilon$-coreset of size $O(Γk^3\varepsilon^{-d-1})$. The former result is the first known coreset construction for the fair clustering problem with the $k$-median objective, and the latter result removes the dependence on the size of the full dataset as in [39] and generalizes it to multiple, non-disjoint types. Plugging our coresets into existing algorithms for fair clustering such as [5] results in the fastest algorithms for several cases. Empirically, we assess our approach over the \textbf{Adult}, \textbf{Bank}, \textbf{Diabetes} and \textbf{Athlete} dataset, and show that the coreset sizes are much smaller than the full dataset. We also achieve a speed-up to recent fair clustering algorithms [5,6] by incorporating our coreset construction.