MLJun 5, 2023
Active Ranking of Experts Based on their Performances in Many TasksEl Mehdi Saad, Nicolas Verzelen, Alexandra Carpentier
We consider the problem of ranking n experts based on their performances on d tasks. We make a monotonicity assumption stating that for each pair of experts, one outperforms the other on all tasks. We consider the sequential setting where in each round, the learner has access to noisy evaluations of actively chosen pair of expert-task, given the information available up to the actual round. Given a confidence parameter $δ$ $\in$ (0, 1), we provide strategies allowing to recover the correct ranking of experts and develop a bound on the total number of queries made by our algorithm that hold with probability at least 1 -- $δ$. We show that our strategy is adaptive to the complexity of the problem (our bounds are instance dependent), and develop matching lower bounds up to a poly-logarithmic factor. Finally, we adapt our strategy to the relaxed problem of best expert identification and provide numerical simulation consistent with our theoretical results.
MLJun 5, 2023
Covariance Adaptive Best Arm IdentificationEl Mehdi Saad, Gilles Blanchard, Nicolas Verzelen
We consider the problem of best arm identification in the multi-armed bandit model, under fixed confidence. Given a confidence input $δ$, the goal is to identify the arm with the highest mean reward with a probability of at least 1 -- $δ$, while minimizing the number of arm pulls. While the literature provides solutions to this problem under the assumption of independent arms distributions, we propose a more flexible scenario where arms can be dependent and rewards can be sampled simultaneously. This framework allows the learner to estimate the covariance among the arms distributions, enabling a more efficient identification of the best arm. The relaxed setting we propose is relevant in various applications, such as clinical trials, where similarities between patients or drugs suggest underlying correlations in the outcomes. We introduce new algorithms that adapt to the unknown covariance of the arms and demonstrate through theoretical guarantees that substantial improvement can be achieved over the standard setting. Additionally, we provide new lower bounds for the relaxed setting and present numerical simulations that support their theoretical findings.
MLJan 12
Nonparametric Kernel Clustering with Bandit FeedbackVictor Thuot, Sebastian Vogt, Debarghya Ghoshdastidar et al.
Clustering with bandit feedback refers to the problem of partitioning a set of items, where the clustering algorithm can sequentially query the items to receive noisy observations. The problem is formally posed as the task of partitioning the arms of an N-armed stochastic bandit according to their underlying distributions, grouping two arms together if and only if they share the same distribution, using samples collected sequentially and adaptively. This setting has gained attention in recent years due to its applicability in recommendation systems and crowdsourcing. Existing works on clustering with bandit feedback rely on a strong assumption that the underlying distributions are sub-Gaussian. As a consequence, the existing methods mainly cover settings with linearly-separable clusters, which has little practical relevance. We introduce a framework of ``nonparametric clustering with bandit feedback'', where the underlying arm distributions are not constrained to any parametric, and hence, it is applicable for active clustering of real-world datasets. We adopt a kernel-based approach, which allows us to reformulate the nonparametric problem as the task of clustering the arms according to their kernel mean embeddings in a reproducing kernel Hilbert space (RKHS). Building on this formulation, we introduce the KABC algorithm with theoretical correctness guarantees and analyze its sampling budget. We introduce a notion of signal-to-noise ratio for this problem that depends on the maximum mean discrepancy (MMD) between the arm distributions and on their variance in the RKHS. Our algorithm is adaptive to this unknown quantity: it does not require it as an input yet achieves instance-dependent guarantees.
MLMar 16
The Sampling Complexity of Condorcet Winner Identification in Dueling BanditsEl Mehdi Saad, Victor Thuot, Nicolas Verzelen
We study best-arm identification in stochastic dueling bandits under the sole assumption that a Condorcet winner exists, i.e., an arm that wins each noisy pairwise comparison with probability at least $1/2$. We introduce a new identification procedure that exploits the full gap matrix $Î_{i,j}=q_{i,j}-\tfrac12$ (where $q_{i,j}$ is the probability that arm $i$ beats arm $j$), rather than only the gaps between the Condorcet winner and the other arms. We derive high-probability, instance-dependent sample-complexity guarantees that (up to logarithmic factors) improve the best known ones by leveraging informative comparisons beyond those involving the winner. We complement these results with new lower bounds which, to our knowledge, are the first for Condorcet-winner identification in stochastic dueling bandits. Our lower-bound analysis isolates the intrinsic cost of locating informative entries in the gap matrix and estimating them to the required confidence, establishing the optimality of our non-asymptotic bounds. Overall, our results reveal new regimes and trade-offs in the sample complexity that are not captured by asymptotic analyses based only on the expected budget.
MLAug 27, 2024
Optimal level set estimation for non-parametric tournament and crowdsourcing problemsMaximilian Graf, Alexandra Carpentier, Nicolas Verzelen
Motivated by crowdsourcing, we consider a problem where we partially observe the correctness of the answers of $n$ experts on $d$ questions. In this paper, we assume that both the experts and the questions can be ordered, namely that the matrix $M$ containing the probability that expert $i$ answers correctly to question $j$ is bi-isotonic up to a permutation of it rows and columns. When $n=d$, this also encompasses the strongly stochastic transitive (SST) model from the tournament literature. Here, we focus on the relevant problem of deciphering small entries of $M$ from large entries of $M$, which is key in crowdsourcing for efficient allocation of workers to questions. More precisely, we aim at recovering a (or several) level set $p$ of the matrix up to a precision $h$, namely recovering resp. the sets of positions $(i,j)$ in $M$ such that $M_{ij}>p+h$ and $M_{i,j}<p-h$. We consider, as a loss measure, the number of misclassified entries. As our main result, we construct an efficient polynomial-time algorithm that turns out to be minimax optimal for this classification problem. This heavily contrasts with existing literature in the SST model where, for the stronger reconstruction loss, statistical-computational gaps have been conjectured. More generally, this shades light on the nature of statistical-computational gaps for permutations models.
STFeb 26
Low-degree Lower bounds for clustering in moderate dimensionAlexandra Carpentier, Nicolas Verzelen
We study the fundamental problem of clustering $n$ points into $K$ groups drawn from a mixture of isotropic Gaussians in $\mathbb{R}^d$. Specifically, we investigate the requisite minimal distance $Δ$ between mean vectors to partially recover the underlying partition. While the minimax-optimal threshold for $Δ$ is well-established, a significant gap exists between this information-theoretic limit and the performance of known polynomial-time procedures. Although this gap was recently characterized in the high-dimensional regime ($n \leq dK$), it remains largely unexplored in the moderate-dimensional regime ($n \geq dK$). In this manuscript, we address this regime by establishing a new low-degree polynomial lower bound for the moderate-dimensional case when $d \geq K$. We show that while the difficulty of clustering for $n \leq dK$ is primarily driven by dimension reduction and spectral methods, the moderate-dimensional regime involves more delicate phenomena leading to a "non-parametric rate". We provide a novel non-spectral algorithm matching this rate, shedding new light on the computational limits of the clustering problem in moderate dimension.
MLNov 26, 2025
Phase Transition for Stochastic Block Model with more than $\sqrt{n}$ Communities (II)Alexandra Carpentier, Christophe Giraud, Nicolas Verzelen
A fundamental theoretical question in network analysis is to determine under which conditions community recovery is possible in polynomial time in the Stochastic Block Model (SBM). When the number $K$ of communities remains smaller than $\sqrt{n}$ --where $n$ denotes the number of nodes--, non-trivial community recovery is possible in polynomial time above, and only above, the Kesten--Stigum (KS) threshold, originally postulated using arguments from statistical physics. When $K \geq \sqrt{n}$, Chin, Mossel, Sohn, and Wein recently proved that, in the \emph{sparse regime}, community recovery in polynomial time is achievable below the KS threshold by counting non-backtracking paths. This finding led them to postulate a new threshold for the many-communities regime $K \geq \sqrt{n}$. Subsequently, Carpentier, Giraud, and Verzelen established the failure of low-degree polynomials below this new threshold across all density regimes, and demonstrated successful recovery above the threshold in certain moderately sparse settings. While these results provide strong evidence that, in the many community setting, the computational barrier lies at the threshold proposed in~Chin et al., the question of achieving recovery above this threshold still remains open in most density regimes. The present work is a follow-up to~Carpentier et al., in which we prove Conjecture~1.4 stated therein by: \\ 1- Constructing a family of motifs satisfying specific structural properties; and\\ 2- Proving that community recovery is possible above the proposed threshold by counting such motifs.\\ Our results complete the picture of the computational barrier for community recovery in the SBM with $K \geq \sqrt{n}$ communities. They also indicate that, in moderately sparse regimes, the optimal algorithms appear to be fundamentally different from spectral methods.
MLSep 11, 2025
Low-degree lower bounds via almost orthonormal basesAlexandra Carpentier, Simone Maria Giancola, Christophe Giraud et al.
Low-degree polynomials have emerged as a powerful paradigm for providing evidence of statistical-computational gaps across a variety of high-dimensional statistical models [Wein25]. For detection problems -- where the goal is to test a planted distribution $\mathbb{P}'$ against a null distribution $\mathbb{P}$ with independent components -- the standard approach is to bound the advantage using an $\mathbb{L}^2(\mathbb{P})$-orthonormal family of polynomials. However, this method breaks down for estimation tasks or more complex testing problems where $\mathbb{P}$ has some planted structures, so that no simple $\mathbb{L}^2(\mathbb{P})$-orthogonal polynomial family is available. To address this challenge, several technical workarounds have been proposed [SW22,SW25], though their implementation can be delicate. In this work, we propose a more direct proof strategy. Focusing on random graph models, we construct a basis of polynomials that is almost orthonormal under $\mathbb{P}$, in precisely those regimes where statistical-computational gaps arise. This almost orthonormal basis not only yields a direct route to establishing low-degree lower bounds, but also allows us to explicitly identify the polynomials that optimize the low-degree criterion. This, in turn, provides insights into the design of optimal polynomial-time algorithms. We illustrate the effectiveness of our approach by recovering known low-degree lower bounds, and establishing new ones for problems such as hidden subcliques, stochastic block models, and seriation models.
MLSep 19, 2025
Model-free algorithms for fast node clustering in SBM type graphs and application to social role inference in animalsBertrand Cloez, Adrien Cotil, Jean-Baptiste Menassol et al.
We propose a novel family of model-free algorithms for node clustering and parameter inference in graphs generated from the Stochastic Block Model (SBM), a fundamental framework in community detection. Drawing inspiration from the Lloyd algorithm for the $k$-means problem, our approach extends to SBMs with general edge weight distributions. We establish the consistency of our estimator under a natural identifiability condition. Through extensive numerical experiments, we benchmark our methods against state-of-the-art techniques, demonstrating significantly faster computation times with the lower order of estimation error. Finally, we validate the practical relevance of our algorithms by applying them to empirical network data from behavioral ecology.
MLSep 19, 2025
Phase Transition for Stochastic Block Model with more than $\sqrt{n}$ CommunitiesAlexandra Carpentier, Christophe Giraud, Nicolas Verzelen
Predictions from statistical physics postulate that recovery of the communities in Stochastic Block Model (SBM) is possible in polynomial time above, and only above, the Kesten-Stigum (KS) threshold. This conjecture has given rise to a rich literature, proving that non-trivial community recovery is indeed possible in SBM above the KS threshold, as long as the number $K$ of communities remains smaller than $\sqrt{n}$, where $n$ is the number of nodes in the observed graph. Failure of low-degree polynomials below the KS threshold was also proven when $K=o(\sqrt{n})$. When $K\geq \sqrt{n}$, Chin et al.(2025) recently prove that, in a sparse regime, community recovery in polynomial time is possible below the KS threshold by counting non-backtracking paths. This breakthrough result lead them to postulate a new threshold for the many communities regime $K\geq \sqrt{n}$. In this work, we provide evidences that confirm their conjecture for $K\geq \sqrt{n}$: 1- We prove that, for any density of the graph, low-degree polynomials fail to recover communities below the threshold postulated by Chin et al.(2025); 2- We prove that community recovery is possible in polynomial time above the postulated threshold, not only in the sparse regime of~Chin et al., but also in some (but not all) moderately sparse regimes by essentially by counting occurrences of cliques or self-avoiding paths of suitable size in the observed graph. In addition, we propose a detailed conjecture regarding the structure of motifs that are optimal in sparsity regimes not covered by cliques or self-avoiding paths counting. In particular, counting self-avoiding paths of length $\log(n)$--which is closely related to spectral algorithms based on the Non-Backtracking operator--is optimal only in the sparse regime. Other motif counts--unrelated to spectral properties--should be considered in denser regimes.
MLMar 14, 2025
Clustering Items through Bandit Feedback: Finding the Right Feature out of ManyMaximilian Graf, Victor Thuot, Nicolas Verzelen
We study the problem of clustering a set of items based on bandit feedback. Each of the $n$ items is characterized by a feature vector, with a possibly large dimension $d$. The items are partitioned into two unknown groups such that items within the same group share the same feature vector. We consider a sequential and adaptive setting in which, at each round, the learner selects one item and one feature, then observes a noisy evaluation of the item's feature. The learner's objective is to recover the correct partition of the items, while keeping the number of observations as small as possible. We provide an algorithm which relies on finding a relevant feature for the clustering task, leveraging the Sequential Halving algorithm. With probability at least $1-δ$, we obtain an accurate recovery of the partition and derive an upper bound on the budget required. Furthermore, we derive an instance-dependent lower bound, which is tight in some relevant cases.
MLJun 17, 2024
Active clustering with bandit feedbackVictor Thuot, Alexandra Carpentier, Christophe Giraud et al.
We investigate the Active Clustering Problem (ACP). A learner interacts with an $N$-armed stochastic bandit with $d$-dimensional subGaussian feedback. There exists a hidden partition of the arms into $K$ groups, such that arms within the same group, share the same mean vector. The learner's task is to uncover this hidden partition with the smallest budget - i.e., the least number of observation - and with a probability of error smaller than a prescribed constant $δ$. In this paper, (i) we derive a non-asymptotic lower bound for the budget, and (ii) we introduce the computationally efficient ACB algorithm, whose budget matches the lower bound in most regimes. We improve on the performance of a uniform sampling strategy. Importantly, contrary to the batch setting, we establish that there is no computation-information gap in the active setting.
STAug 6, 2021
Localization in 1D non-parametric latent space models from pairwise affinitiesChristophe Giraud, Yann Issartel, Nicolas Verzelen
We consider the problem of estimating latent positions in a one-dimensional torus from pairwise affinities. The observed affinity between a pair of items is modeled as a noisy observation of a function $f(x^*_{i},x^*_{j})$ of the latent positions $x^*_{i},x^*_{j}$ of the two items on the torus. The affinity function $f$ is unknown, and it is only assumed to fulfill some shape constraints ensuring that $f(x,y)$ is large when the distance between $x$ and $y$ is small, and vice-versa. This non-parametric modeling offers a good flexibility to fit data. We introduce an estimation procedure that provably localizes all the latent positions with a maximum error of the order of $\sqrt{\log(n)/n}$, with high-probability. This rate is proven to be minimax optimal. A computationally efficient variant of the procedure is also analyzed under some more restrictive assumptions. Our general results can be instantiated to the problem of statistical seriation, leading to new bounds for the maximum error in the ordering.
STJul 19, 2018
Partial recovery bounds for clustering with the relaxed $K$meansChristophe Giraud, Nicolas Verzelen
We investigate the clustering performances of the relaxed $K$means in the setting of sub-Gaussian Mixture Model (sGMM) and Stochastic Block Model (SBM). After identifying the appropriate signal-to-noise ratio (SNR), we prove that the misclassification error decay exponentially fast with respect to this SNR. These partial recovery bounds for the relaxed $K$means improve upon results currently known in the sGMM setting. In the SBM setting, applying the relaxed $K$means SDP allows to handle general connection probabilities whereas other SDPs investigated in the literature are restricted to the assortative case (where within group probabilities are larger than between group probabilities). Again, this partial recovery bound complements the state-of-the-art results. All together, these results put forward the versatility of the relaxed $K$means.
MEAug 8, 2015
Model Assisted Variable Clustering: Minimax-optimal Recovery and AlgorithmsFlorentina Bunea, Christophe Giraud, Xi Luo et al.
Model-based clustering defines population level clusters relative to a model that embeds notions of similarity. Algorithms tailored to such models yield estimated clusters with a clear statistical interpretation. We take this view here and introduce the class of G-block covariance models as a background model for variable clustering. In such models, two variables in a cluster are deemed similar if they have similar associations will all other variables. This can arise, for instance, when groups of variables are noise corrupted versions of the same latent factor. We quantify the difficulty of clustering data generated from a G-block covariance model in terms of cluster proximity, measured with respect to two related, but different, cluster separation metrics. We derive minimax cluster separation thresholds, which are the metric values below which no algorithm can recover the model-defined clusters exactly, and show that they are different for the two metrics. We therefore develop two algorithms, COD and PECOK, tailored to G-block covariance models, and study their minimax-optimality with respect to each metric. Of independent interest is the fact that the analysis of the PECOK algorithm, which is based on a corrected convex relaxation of the popular K-means algorithm, provides the first statistical analysis of such algorithms for variable clustering. Additionally, we contrast our methods with another popular clustering method, spectral clustering, specialized to variable clustering, and show that ensuring exact cluster recovery via this method requires clusters to have a higher separation, relative to the minimax threshold. Extensive simulation studies, as well as our data analyses, confirm the applicability of our approach.
STAug 13, 2013
Community Detection in Sparse Random NetworksEry Arias-Castro, Nicolas Verzelen
We consider the problem of detecting a tight community in a sparse random network. This is formalized as testing for the existence of a dense random subgraph in a random graph. Under the null hypothesis, the graph is a realization of an Erdös-Rényi graph on $N$ vertices and with connection probability $p_0$; under the alternative, there is an unknown subgraph on $n$ vertices where the connection probability is p1 > p0. In Arias-Castro and Verzelen (2012), we focused on the asymptotically dense regime where p0 is large enough that np0>(n/N)^{o(1)}. We consider here the asymptotically sparse regime where p0 is small enough that np0<(n/N)^{c0} for some c0>0. As before, we derive information theoretic lower bounds, and also establish the performance of various tests. Compared to our previous work, the arguments for the lower bounds are based on the same technology, but are substantially more technical in the details; also, the methods we study are different: besides a variant of the scan statistic, we study other statistics such as the size of the largest connected component, the number of triangles, the eigengap of the adjacency matrix, etc. Our detection bounds are sharp, except in the Poisson regime where we were not able to fully characterize the constant arising in the bound.
STFeb 28, 2013
Community Detection in Random NetworksEry Arias-Castro, Nicolas Verzelen
We formalize the problem of detecting a community in a network into testing whether in a given (random) graph there is a subgraph that is unusually dense. We observe an undirected and unweighted graph on N nodes. Under the null hypothesis, the graph is a realization of an Erdös-Rényi graph with probability p0. Under the (composite) alternative, there is a subgraph of n nodes where the probability of connection is p1 > p0. We derive a detection lower bound for detecting such a subgraph in terms of N, n, p0, p1 and exhibit a test that achieves that lower bound. We do this both when p0 is known and unknown. We also consider the problem of testing in polynomial-time. As an aside, we consider the problem of detecting a clique, which is intimately related to the planted clique problem. Our focus in this paper is in the quasi-normal regime where n p0 is either bounded away from zero, or tends to zero slowly.