Lorenzo Beretta

LG
h-index10
5papers
14citations
Novelty56%
AI Score33

5 Papers

CGSep 28, 2023
Multi-Swap $k$-Means++

Lorenzo Beretta, Vincent Cohen-Addad, Silvio Lattanzi et al.

The $k$-means++ algorithm of Arthur and Vassilvitskii (SODA 2007) is often the practitioners' choice algorithm for optimizing the popular $k$-means clustering objective and is known to give an $O(\log k)$-approximation in expectation. To obtain higher quality solutions, Lattanzi and Sohler (ICML 2019) proposed augmenting $k$-means++ with $O(k \log \log k)$ local search steps obtained through the $k$-means++ sampling distribution to yield a $c$-approximation to the $k$-means clustering problem, where $c$ is a large absolute constant. Here we generalize and extend their local search algorithm by considering larger and more sophisticated local search neighborhoods hence allowing to swap multiple centers at the same time. Our algorithm achieves a $9 + \varepsilon$ approximation ratio, which is the best possible for local search. Importantly we show that our approach yields substantial practical improvements, we show significant quality improvements over the approach of Lattanzi and Sohler (ICML 2019) on several datasets.

DSAug 9, 2025
Approximating High-Dimensional Earth Mover's Distance as Fast as Closest Pair

Lorenzo Beretta, Vincent Cohen-Addad, Rajesh Jayaram et al.

We give a reduction from $(1+\varepsilon)$-approximate Earth Mover's Distance (EMD) to $(1+\varepsilon)$-approximate Closest Pair (CP). As a consequence, we improve the fastest known approximation algorithm for high-dimensional EMD. Here, given $p\in [1, 2]$ and two sets of $n$ points $X,Y \subseteq (\mathbb R^d,\ell_p)$, their EMD is the minimum cost of a perfect matching between $X$ and $Y$, where the cost of matching two vectors is their $\ell_p$ distance. Further, CP is the basic problem of finding a pair of points realizing $\min_{x \in X, y\in Y} ||x-y||_p$. Our contribution is twofold: we show that if a $(1+\varepsilon)$-approximate CP can be computed in time $n^{2-φ}$, then a $1+O(\varepsilon)$ approximation to EMD can be computed in time $n^{2-Ω(φ)}$; plugging in the fastest known algorithm for CP [Alman, Chan, Williams FOCS'16], we obtain a $(1+\varepsilon)$-approximation algorithm for EMD running in time $n^{2-\tildeΩ(\varepsilon^{1/3})}$ for high-dimensional point sets, which improves over the prior fastest running time of $n^{2-Ω(\varepsilon^2)}$ [Andoni, Zhang FOCS'23]. Our main technical contribution is a sublinear implementation of the Multiplicative Weights Update framework for EMD. Specifically, we demonstrate that the updates can be executed without ever explicitly computing or storing the weights; instead, we exploit the underlying geometric structure to perform the updates implicitly.

LGMay 9, 2025
New Statistical and Computational Results for Learning Junta Distributions

Lorenzo Beretta

We study the problem of learning junta distributions on $\{0, 1\}^n$, where a distribution is a $k$-junta if its probability mass function depends on a subset of at most $k$ variables. We make two main contributions: - We show that learning $k$-junta distributions is \emph{computationally} equivalent to learning $k$-parity functions with noise (LPN), a landmark problem in computational learning theory. - We design an algorithm for learning junta distributions whose statistical complexity is optimal, up to polylogarithmic factors. Computationally, our algorithm matches the complexity of previous (non-sample-optimal) algorithms. Combined, our two contributions imply that our algorithm cannot be significantly improved, statistically or computationally, barring a breakthrough for LPN.

LGMay 7, 2025
Feature Selection and Junta Testing are Statistically Equivalent

Lorenzo Beretta, Nathaniel Harms, Caleb Koch

For a function $f \colon \{0,1\}^n \to \{0,1\}$, the junta testing problem asks whether $f$ depends on only $k$ variables. If $f$ depends on only $k$ variables, the feature selection problem asks to find those variables. We prove that these two tasks are statistically equivalent. Specifically, we show that the ``brute-force'' algorithm, which checks for any set of $k$ variables consistent with the sample, is simultaneously sample-optimal for both problems, and the optimal sample size is \[ Θ\left(\frac 1 \varepsilon \left( \sqrt{2^k \log {n \choose k}} + \log {n \choose k}\right)\right). \]

IRNov 26, 2021
An Optimal Algorithm for Finding Champions in Tournament Graphs

Lorenzo Beretta, Franco Maria Nardini, Roberto Trani et al.

A tournament graph is a complete directed graph, which can be used to model a round-robin tournament between $n$ players. In this paper, we address the problem of finding a champion of the tournament, also known as Copeland winner, which is a player that wins the highest number of matches. In detail, we aim to investigate algorithms that find the champion by playing a low number of matches. Solving this problem allows us to speed up several Information Retrieval and Recommender System applications, including question answering, conversational search, etc. Indeed, these applications often search for the champion inducing a round-robin tournament among the players by employing a machine learning model to estimate who wins each pairwise comparison. Our contribution, thus, allows finding the champion by performing a low number of model inferences. We prove that any deterministic or randomized algorithm finding a champion with constant success probability requires $Ω(\ell n)$ comparisons, where $\ell$ is the number of matches lost by the champion. We then present an asymptotically-optimal deterministic algorithm matching this lower bound without knowing $\ell$, and we extend our analysis to three variants of the problem. Lastly, we conduct a comprehensive experimental assessment of the proposed algorithms on a question answering task on public data. Results show that our proposed algorithms speed up the retrieval of the champion up to $13\times$ with respect to the state-of-the-art algorithm that perform the full tournament.