Mingda Qiao

LG
h-index25
24papers
459citations
Novelty58%
AI Score57

24 Papers

LGOct 5, 2022
A Fourier Approach to Mixture Learning

Mingda Qiao, Guru Guruganesh, Ankit Singh Rawat et al.

We revisit the problem of learning mixtures of spherical Gaussians. Given samples from mixture $\frac{1}{k}\sum_{j=1}^{k}\mathcal{N}(μ_j, I_d)$, the goal is to estimate the means $μ_1, μ_2, \ldots, μ_k \in \mathbb{R}^d$ up to a small error. The hardness of this learning problem can be measured by the separation $Δ$ defined as the minimum distance between all pairs of means. Regev and Vijayaraghavan (2017) showed that with $Δ= Ω(\sqrt{\log k})$ separation, the means can be learned using $\mathrm{poly}(k, d)$ samples, whereas super-polynomially many samples are required if $Δ= o(\sqrt{\log k})$ and $d = Ω(\log k)$. This leaves open the low-dimensional regime where $d = o(\log k)$. In this work, we give an algorithm that efficiently learns the means in $d = O(\log k/\log\log k)$ dimensions under separation $d/\sqrt{\log k}$ (modulo doubly logarithmic factors). This separation is strictly smaller than $\sqrt{\log k}$, and is also shown to be necessary. Along with the results of Regev and Vijayaraghavan (2017), our work almost pins down the critical separation threshold at which efficient parameter learning becomes possible for spherical Gaussian mixtures. More generally, our algorithm runs in time $\mathrm{poly}(k)\cdot f(d, Δ, ε)$, and is thus fixed-parameter tractable in parameters $d$, $Δ$ and $ε$. Our approach is based on estimating the Fourier transform of the mixture at carefully chosen frequencies, and both the algorithm and its analysis are simple and elementary. Our positive results can be easily extended to learning mixtures of non-Gaussian distributions, under a mild condition on the Fourier spectrum of the distribution.

LGJul 19, 2024
Truthfulness of Calibration Measures

Nika Haghtalab, Mingda Qiao, Kunhe Yang et al.

We initiate the study of the truthfulness of calibration measures in sequential prediction. A calibration measure is said to be truthful if the forecaster (approximately) minimizes the expected penalty by predicting the conditional expectation of the next outcome, given the prior distribution of outcomes. Truthfulness is an important property of calibration measures, ensuring that the forecaster is not incentivized to exploit the system with deliberate poor forecasts. This makes it an essential desideratum for calibration measures, alongside typical requirements, such as soundness and completeness. We conduct a taxonomy of existing calibration measures and their truthfulness. Perhaps surprisingly, we find that all of them are far from being truthful. That is, under existing calibration measures, there are simple distributions on which a polylogarithmic (or even zero) penalty is achievable, while truthful prediction leads to a polynomial penalty. Our main contribution is the introduction of a new calibration measure termed the Subsampled Smooth Calibration Error (SSCE) under which truthful prediction is optimal up to a constant multiplicative factor.

DSJun 29, 2022
Open Problem: Properly learning decision trees in polynomial time?

Guy Blanc, Jane Lange, Mingda Qiao et al.

The authors recently gave an $n^{O(\log\log n)}$ time membership query algorithm for properly learning decision trees under the uniform distribution (Blanc et al., 2021). The previous fastest algorithm for this problem ran in $n^{O(\log n)}$ time, a consequence of Ehrenfeucht and Haussler (1989)'s classic algorithm for the distribution-free setting. In this article we highlight the natural open problem of obtaining a polynomial-time algorithm, discuss possible avenues towards obtaining it, and state intermediate milestones that we believe are of independent interest.

DSNov 28, 2023
A Combinatorial Approach to Robust PCA

Weihao Kong, Mingda Qiao, Rajat Sen

We study the problem of recovering Gaussian data under adversarial corruptions when the noises are low-rank and the corruptions are on the coordinate level. Concretely, we assume that the Gaussian noises lie in an unknown $k$-dimensional subspace $U \subseteq \mathbb{R}^d$, and $s$ randomly chosen coordinates of each data point fall into the control of an adversary. This setting models the scenario of learning from high-dimensional yet structured data that are transmitted through a highly-noisy channel, so that the data points are unlikely to be entirely clean. Our main result is an efficient algorithm that, when $ks^2 = O(d)$, recovers every single data point up to a nearly-optimal $\ell_1$ error of $\tilde O(ks/d)$ in expectation. At the core of our proof is a new analysis of the well-known Basis Pursuit (BP) method for recovering a sparse signal, which is known to succeed under additional assumptions (e.g., incoherence or the restricted isometry property) on the underlying subspace $U$. In contrast, we present a novel approach via studying a natural combinatorial problem and show that, over the randomness in the support of the sparse signal, a high-probability error bound is possible even if the subspace $U$ is arbitrary.

DSMay 7
Optimal $k$-Secretary with Logarithmic Memory

Mingda Qiao, Wei Zhang

We study memory-bounded algorithms for the $k$-secretary problem. The algorithm of Kleinberg (SODA 2005) achieves an optimal competitive ratio of $1 - O(1/\sqrt{k})$, yet a straightforward implementation requires $Ω(k)$ memory. Our main result is a $k$-secretary algorithm that matches the optimal competitive ratio using $O(\log k)$ words of memory. We prove this result by establishing a general reduction from $k$-secretary to (random-order) quantile estimation, the problem of finding the $k$-th largest element in a stream. We show that a quantile estimation algorithm with an $O(k^α)$ expected error (in terms of the rank) gives a $(1 - O(1/k^{1-α}))$-competitive $k$-secretary algorithm with $O(1)$ extra words. We then introduce a new quantile estimation algorithm that achieves an $O(\sqrt{k})$ expected error bound using $O(\log k)$ memory. Of independent interest, we give a different algorithm that uses $O(\sqrt{k})$ words and finds the $k$-th largest element exactly with high probability, generalizing a result of Munro and Paterson (1980).

LGFeb 12, 2024
On the Distance from Calibration in Sequential Prediction

Mingda Qiao, Letian Zheng

We study a sequential binary prediction setting where the forecaster is evaluated in terms of the calibration distance, which is defined as the $L_1$ distance between the predicted values and the set of predictions that are perfectly calibrated in hindsight. This is analogous to a calibration measure recently proposed by Błasiok, Gopalan, Hu and Nakkiran (STOC 2023) for the offline setting. The calibration distance is a natural and intuitive measure of deviation from perfect calibration, and satisfies a Lipschitz continuity property which does not hold for many popular calibration measures, such as the $L_1$ calibration error and its variants. We prove that there is a forecasting algorithm that achieves an $O(\sqrt{T})$ calibration distance in expectation on an adversarially chosen sequence of $T$ binary outcomes. At the core of this upper bound is a structural result showing that the calibration distance is accurately approximated by the lower calibration distance, which is a continuous relaxation of the former. We then show that an $O(\sqrt{T})$ lower calibration distance can be achieved via a simple minimax argument and a reduction to online learning on a Lipschitz class. On the lower bound side, an $Ω(T^{1/3})$ calibration distance is shown to be unavoidable, even when the adversary outputs a sequence of independent random bits, and has an additional ability to early stop (i.e., to stop producing random bits and output the same bit in the remaining steps). Interestingly, without this early stopping, the forecaster can achieve a much smaller calibration distance of $\mathrm{polylog}(T)$.

DSMar 19
Computational and Statistical Hardness of Calibration Distance

Mingda Qiao

The distance from calibration, introduced by Błasiok, Gopalan, Hu, and Nakkiran (STOC 2023), has recently emerged as a central measure of miscalibration for probabilistic predictors. We study the fundamental problems of computing and estimating this quantity, given either an exact description of the data distribution or only sample access to it. We give an efficient algorithm that exactly computes the calibration distance when the distribution has a uniform marginal and noiseless labels, which improves the $O(1/\sqrt{|\mathcal{X}|})$ additive approximation of Qiao and Zheng (COLT 2024) for this special case. Perhaps surprisingly, the problem becomes $\mathsf{NP}$-hard when either of the two assumptions is removed. We extend our algorithm to a polynomial-time approximation scheme for the general case. For the estimation problem, we show that $Θ(1/ε^3)$ samples are sufficient and necessary for the empirical calibration distance to be upper bounded by the true distance plus $ε$. In contrast, a polynomial dependence on the domain size -- incurred by the learning-based baseline -- is unavoidable for two-sided estimation. Our positive results are based on simple sparsifications of both the distribution and the target predictor, which significantly reduce the search space for computation and lead to stronger concentration for the estimation problem. To prove the hardness results, we introduce new techniques for certifying lower bounds on the calibration distance -- a problem that is hard in general due to its $\textsf{co-NP}$-completeness.

LGDec 1, 2025
Limitations of Membership Queries in Testable Learning

Jane Lange, Mingda Qiao

Membership queries (MQ) often yield speedups for learning tasks, particularly in the distribution-specific setting. We show that in the \emph{testable learning} model of Rubinfeld and Vasilyan [RV23], membership queries cannot decrease the time complexity of testable learning algorithms beyond the complexity of sample-only distribution-specific learning. In the testable learning model, the learner must output a hypothesis whenever the data distribution satisfies a desired property, and if it outputs a hypothesis, the hypothesis must be near-optimal. We give a general reduction from sample-based \emph{refutation} of boolean concept classes, as presented in [Vadhan17, KL18], to testable learning with queries (TL-Q). This yields lower bounds for TL-Q via the reduction from learning to refutation given in [KL18]. The result is that, relative to a concept class and a distribution family, no $m$-sample TL-Q algorithm can be super-polynomially more time-efficient than the best $m$-sample PAC learner. Finally, we define a class of ``statistical'' MQ algorithms that encompasses many known distribution-specific MQ learners, such as those based on influence estimation or subcube-conditional statistical queries. We show that TL-Q algorithms in this class imply efficient statistical-query refutation and learning algorithms. Thus, combined with known SQ dimension lower bounds, our results imply that these efficient membership query learners cannot be made testable.

LGMar 4, 2025
Truthfulness of Decision-Theoretic Calibration Measures

Mingda Qiao, Eric Zhao

Calibration measures quantify how much a forecaster's predictions violates calibration, which requires that forecasts are unbiased conditioning on the forecasted probabilities. Two important desiderata for a calibration measure are its decision-theoretic implications (i.e., downstream decision-makers that best-respond to the forecasts are always no-regret) and its truthfulness (i.e., a forecaster approximately minimizes error by always reporting the true probabilities). Existing measures satisfy at most one of the properties, but not both. We introduce a new calibration measure termed subsampled step calibration, $\mathsf{StepCE}^{\textsf{sub}}$, that is both decision-theoretic and truthful. In particular, on any product distribution, $\mathsf{StepCE}^{\textsf{sub}}$ is truthful up to an $O(1)$ factor whereas prior decision-theoretic calibration measures suffer from an $e^{-Ω(T)}$-$Ω(\sqrt{T})$ truthfulness gap. Moreover, in any smoothed setting where the conditional probability of each event is perturbed by a noise of magnitude $c > 0$, $\mathsf{StepCE}^{\textsf{sub}}$ is truthful up to an $O(\sqrt{\log(1/c)})$ factor, while prior decision-theoretic measures have an $e^{-Ω(T)}$-$Ω(T^{1/3})$ truthfulness gap. We also prove a general impossibility result for truthful decision-theoretic forecasting: any complete and decision-theoretic calibration measure must be discontinuous and non-truthful in the non-smoothed setting.

LGFeb 16, 2024
Collaborative Learning with Different Labeling Functions

Yuyang Deng, Mingda Qiao

We study a variant of Collaborative PAC Learning, in which we aim to learn an accurate classifier for each of the $n$ data distributions, while minimizing the number of samples drawn from them in total. Unlike in the usual collaborative learning setup, it is not assumed that there exists a single classifier that is simultaneously accurate for all distributions. We show that, when the data distributions satisfy a weaker realizability assumption, which appeared in [Crammer and Mansour, 2012] in the context of multi-task learning, sample-efficient learning is still feasible. We give a learning algorithm based on Empirical Risk Minimization (ERM) on a natural augmentation of the hypothesis class, and the analysis relies on an upper bound on the VC dimension of this augmented class. In terms of the computational efficiency, we show that ERM on the augmented hypothesis class is NP-hard, which gives evidence against the existence of computationally efficient learners in general. On the positive side, for two special cases, we give learners that are both sample- and computationally-efficient.

LGNov 19, 2025
Sample-Adaptivity Tradeoff in On-Demand Sampling

Nika Haghtalab, Omar Montasser, Mingda Qiao

We study the tradeoff between sample complexity and round complexity in on-demand sampling, where the learning algorithm adaptively samples from $k$ distributions over a limited number of rounds. In the realizable setting of Multi-Distribution Learning (MDL), we show that the optimal sample complexity of an $r$-round algorithm scales approximately as $dk^{Θ(1/r)} / ε$. For the general agnostic case, we present an algorithm that achieves near-optimal sample complexity of $\widetilde O((d + k) / ε^2)$ within $\widetilde O(\sqrt{k})$ rounds. Of independent interest, we introduce a new framework, Optimization via On-Demand Sampling (OODS), which abstracts the sample-adaptivity tradeoff and captures most existing MDL algorithms. We establish nearly tight bounds on the round complexity in the OODS setting. The upper bounds directly yield the $\widetilde O(\sqrt{k})$-round algorithm for agnostic MDL, while the lower bounds imply that achieving sub-polynomial round complexity would require fundamentally new techniques that bypass the inherent hardness of OODS.

LGAug 13, 2025
Online Prediction with Limited Selectivity

Licheng Liu, Mingda Qiao

Selective prediction [Dru13, QV19] models the scenario where a forecaster freely decides on the prediction window that their forecast spans. Many data statistics can be predicted to a non-trivial error rate without any distributional assumptions or expert advice, yet these results rely on that the forecaster may predict at any time. We introduce a model of Prediction with Limited Selectivity (PLS) where the forecaster can start the prediction only on a subset of the time horizon. We study the optimal prediction error both on an instance-by-instance basis and via an average-case analysis. We introduce a complexity measure that gives instance-dependent bounds on the optimal error. For a randomly-generated PLS instance, these bounds match with high probability.

DSSep 1, 2021
Properly learning decision trees in almost polynomial time

Guy Blanc, Jane Lange, Mingda Qiao et al.

We give an $n^{O(\log\log n)}$-time membership query algorithm for properly and agnostically learning decision trees under the uniform distribution over $\{\pm 1\}^n$. Even in the realizable setting, the previous fastest runtime was $n^{O(\log n)}$, a consequence of a classic algorithm of Ehrenfeucht and Haussler. Our algorithm shares similarities with practical heuristics for learning decision trees, which we augment with additional ideas to circumvent known lower bounds against these heuristics. To analyze our algorithm, we prove a new structural result for decision trees that strengthens a theorem of O'Donnell, Saks, Schramm, and Servedio. While the OSSS theorem says that every decision tree has an influential variable, we show how every decision tree can be "pruned" so that every variable in the resulting tree is influential.

LGJul 2, 2021
Decision tree heuristics can fail, even in the smoothed setting

Guy Blanc, Jane Lange, Mingda Qiao et al.

Greedy decision tree learning heuristics are mainstays of machine learning practice, but theoretical justification for their empirical success remains elusive. In fact, it has long been known that there are simple target functions for which they fail badly (Kearns and Mansour, STOC 1996). Recent work of Brutzkus, Daniely, and Malach (COLT 2020) considered the smoothed analysis model as a possible avenue towards resolving this disconnect. Within the smoothed setting and for targets $f$ that are $k$-juntas, they showed that these heuristics successfully learn $f$ with depth-$k$ decision tree hypotheses. They conjectured that the same guarantee holds more generally for targets that are depth-$k$ decision trees. We provide a counterexample to this conjecture: we construct targets that are depth-$k$ decision trees and show that even in the smoothed setting, these heuristics build trees of depth $2^{Ω(k)}$ before achieving high accuracy. We also show that the guarantees of Brutzkus et al. cannot extend to the agnostic setting: there are targets that are very close to $k$-juntas, for which these heuristics build trees of depth $2^{Ω(k)}$ before achieving high accuracy.

LGJun 29, 2021
Exponential Weights Algorithms for Selective Learning

Mingda Qiao, Gregory Valiant

We study the selective learning problem introduced by Qiao and Valiant (2019), in which the learner observes $n$ labeled data points one at a time. At a time of its choosing, the learner selects a window length $w$ and a model $\hat\ell$ from the model class $\mathcal{L}$, and then labels the next $w$ data points using $\hat\ell$. The excess risk incurred by the learner is defined as the difference between the average loss of $\hat\ell$ over those $w$ data points and the smallest possible average loss among all models in $\mathcal{L}$ over those $w$ data points. We give an improved algorithm, termed the hybrid exponential weights algorithm, that achieves an expected excess risk of $O((\log\log|\mathcal{L}| + \log\log n)/\log n)$. This result gives a doubly exponential improvement in the dependence on $|\mathcal{L}|$ over the best known bound of $O(\sqrt{|\mathcal{L}|/\log n})$. We complement the positive result with an almost matching lower bound, which suggests the worst-case optimality of the algorithm. We also study a more restrictive family of learning algorithms that are bounded-recall in the sense that when a prediction window of length $w$ is chosen, the learner's decision only depends on the most recent $w$ data points. We analyze an exponential weights variant of the ERM algorithm in Qiao and Valiant (2019). This new algorithm achieves an expected excess risk of $O(\sqrt{\log |\mathcal{L}|/\log n})$, which is shown to be nearly optimal among all bounded-recall learners. Our analysis builds on a generalized version of the selective mean prediction problem in Drucker (2013); Qiao and Valiant (2019), which may be of independent interest.

LGDec 7, 2020
Stronger Calibration Lower Bounds via Sidestepping

Mingda Qiao, Gregory Valiant

We consider an online binary prediction setting where a forecaster observes a sequence of $T$ bits one by one. Before each bit is revealed, the forecaster predicts the probability that the bit is $1$. The forecaster is called well-calibrated if for each $p \in [0, 1]$, among the $n_p$ bits for which the forecaster predicts probability $p$, the actual number of ones, $m_p$, is indeed equal to $p \cdot n_p$. The calibration error, defined as $\sum_p |m_p - p n_p|$, quantifies the extent to which the forecaster deviates from being well-calibrated. It has long been known that an $O(T^{2/3})$ calibration error is achievable even when the bits are chosen adversarially, and possibly based on the previous predictions. However, little is known on the lower bound side, except an $Ω(\sqrt{T})$ bound that follows from the trivial example of independent fair coin flips. In this paper, we prove an $Ω(T^{0.528})$ bound on the calibration error, which is the first super-$\sqrt{T}$ lower bound for this setting to the best of our knowledge. The technical contributions of our work include two lower bound techniques, early stopping and sidestepping, which circumvent the obstacles that have previously hindered strong calibration lower bounds. We also propose an abstraction of the prediction setting, termed the Sign-Preservation game, which may be of independent interest. This game has a much smaller state space than the full prediction setting and allows simpler analyses. The $Ω(T^{0.528})$ lower bound follows from a general reduction theorem that translates lower bounds on the game value of Sign-Preservation into lower bounds on the calibration error.

LGFeb 12, 2019
A Theory of Selective Prediction

Mingda Qiao, Gregory Valiant

We consider a model of selective prediction, where the prediction algorithm is given a data sequence in an online fashion and asked to predict a pre-specified statistic of the upcoming data points. The algorithm is allowed to choose when to make the prediction as well as the length of the prediction window, possibly depending on the observations so far. We prove that, even without any distributional assumption on the input data stream, a large family of statistics can be estimated to non-trivial accuracy. To give one concrete example, suppose that we are given access to an arbitrary binary sequence $x_1, \ldots, x_n$ of length $n$. Our goal is to accurately predict the average observation, and we are allowed to choose the window over which the prediction is made: for some $t < n$ and $m \le n - t$, after seeing $t$ observations we predict the average of $x_{t+1}, \ldots, x_{t+m}$. This particular problem was first studied in Drucker (2013) and referred to as the "density prediction game". We show that the expected squared error of our prediction can be bounded by $O(\frac{1}{\log n})$ and prove a matching lower bound, which resolves an open question raised in Drucker (2013). This result holds for any sequence (that is not adaptive to when the prediction is made, or the predicted value), and the expectation of the error is with respect to the randomness of the prediction algorithm. Our results apply to more general statistics of a sequence of observations, and we highlight several open directions for future work.

LGFeb 2, 2019
On Generalization Error Bounds of Noisy Gradient Methods for Non-Convex Learning

Jian Li, Xuanyuan Luo, Mingda Qiao

Generalization error (also known as the out-of-sample error) measures how well the hypothesis learned from training data generalizes to previously unseen data. Proving tight generalization error bounds is a central question in statistical learning theory. In this paper, we obtain generalization error bounds for learning general non-convex objectives, which has attracted significant attention in recent years. We develop a new framework, termed Bayes-Stability, for proving algorithm-dependent generalization error bounds. The new framework combines ideas from both the PAC-Bayesian theory and the notion of algorithmic stability. Applying the Bayes-Stability method, we obtain new data-dependent generalization bounds for stochastic gradient Langevin dynamics (SGLD) and several other noisy gradient methods (e.g., with momentum, mini-batch and acceleration, Entropy-SGD). Our result recovers (and is typically tighter than) a recent result in Mou et al. (2018) and improves upon the results in Pensia et al. (2018). Our experiments demonstrate that our data-dependent bounds can distinguish randomly labelled data from normal data, which provides an explanation to the intriguing phenomena observed in Zhang et al. (2017a). We also study the setting where the total loss is the sum of a bounded loss and an additional \ell_2 regularization term. We obtain new generalization bounds for the continuous Langevin dynamic in this setting by developing a new Log-Sobolev inequality for the parameter distribution at any time. Our new bounds are more desirable when the noisy level of the process is not small, and do not become vacuous even when T tends to infinity.

LGMay 12, 2018
Do Outliers Ruin Collaboration?

Mingda Qiao

We consider the problem of learning a binary classifier from $n$ different data sources, among which at most an $η$ fraction are adversarial. The overhead is defined as the ratio between the sample complexity of learning in this setting and that of learning the same hypothesis class on a single data distribution. We present an algorithm that achieves an $O(ηn + \ln n)$ overhead, which is proved to be worst-case optimal. We also discuss the potential challenges to the design of a computationally efficient learning algorithm with a small overhead.

LGNov 22, 2017
Learning Discrete Distributions from Untrusted Batches

Mingda Qiao, Gregory Valiant

We consider the problem of learning a discrete distribution in the presence of an $ε$ fraction of malicious data sources. Specifically, we consider the setting where there is some underlying distribution, $p$, and each data source provides a batch of $\ge k$ samples, with the guarantee that at least a $(1-ε)$ fraction of the sources draw their samples from a distribution with total variation distance at most $η$ from $p$. We make no assumptions on the data provided by the remaining $ε$ fraction of sources--this data can even be chosen as an adversarial function of the $(1-ε)$ fraction of "good" batches. We provide two algorithms: one with runtime exponential in the support size, $n$, but polynomial in $k$, $1/ε$ and $1/η$ that takes $O((n+k)/ε^2)$ batches and recovers $p$ to error $O(η+ε/\sqrt{k})$. This recovery accuracy is information theoretically optimal, to constant factors, even given an infinite number of data sources. Our second algorithm applies to the $η= 0$ setting and also achieves an $O(ε/\sqrt{k})$ recover guarantee, though it runs in $\mathrm{poly}((nk)^k)$ time. This second algorithm, which approximates a certain tensor via a rank-1 tensor minimizing $\ell_1$ distance, is surprising in light of the hardness of many low-rank tensor approximation problems, and may be of independent interest.

LGJun 4, 2017
Nearly Optimal Sampling Algorithms for Combinatorial Pure Exploration

Lijie Chen, Anupam Gupta, Jian Li et al.

We study the combinatorial pure exploration problem Best-Set in stochastic multi-armed bandits. In a Best-Set instance, we are given $n$ arms with unknown reward distributions, as well as a family $\mathcal{F}$ of feasible subsets over the arms. Our goal is to identify the feasible subset in $\mathcal{F}$ with the maximum total mean using as few samples as possible. The problem generalizes the classical best arm identification problem and the top-$k$ arm identification problem, both of which have attracted significant attention in recent years. We provide a novel instance-wise lower bound for the sample complexity of the problem, as well as a nontrivial sampling algorithm, matching the lower bound up to a factor of $\ln|\mathcal{F}|$. For an important class of combinatorial families, we also provide polynomial time implementation of the sampling algorithm, using the equivalence of separation and optimization for convex program, and approximate Pareto curves in multi-objective optimization. We also show that the $\ln|\mathcal{F}|$ factor is inevitable in general through a nontrivial lower bound construction. Our results significantly improve several previous results for several important combinatorial constraints, and provide a tighter understanding of the general Best-Set problem. We further introduce an even more general problem, formulated in geometric terms. We are given $n$ Gaussian arms with unknown means and unit variance. Consider the $n$-dimensional Euclidean space $\mathbb{R}^n$, and a collection $\mathcal{O}$ of disjoint subsets. Our goal is to determine the subset in $\mathcal{O}$ that contains the $n$-dimensional vector of the means. The problem generalizes most pure exploration bandit problems studied in the literature. We provide the first nearly optimal sample complexity upper and lower bounds for the problem.

LGMay 19, 2017
Practical Algorithms for Best-K Identification in Multi-Armed Bandits

Haotian Jiang, Jian Li, Mingda Qiao

In the Best-$K$ identification problem (Best-$K$-Arm), we are given $N$ stochastic bandit arms with unknown reward distributions. Our goal is to identify the $K$ arms with the largest means with high confidence, by drawing samples from the arms adaptively. This problem is motivated by various practical applications and has attracted considerable attention in the past decade. In this paper, we propose new practical algorithms for the Best-$K$-Arm problem, which have nearly optimal sample complexity bounds (matching the lower bound up to logarithmic factors) and outperform the state-of-the-art algorithms for the Best-$K$-Arm problem (even for $K=1$) in practice.

LGFeb 13, 2017
Nearly Instance Optimal Sample Complexity Bounds for Top-k Arm Selection

Lijie Chen, Jian Li, Mingda Qiao

In the Best-$k$-Arm problem, we are given $n$ stochastic bandit arms, each associated with an unknown reward distribution. We are required to identify the $k$ arms with the largest means by taking as few samples as possible. In this paper, we make progress towards a complete characterization of the instance-wise sample complexity bounds for the Best-$k$-Arm problem. On the lower bound side, we obtain a novel complexity term to measure the sample complexity that every Best-$k$-Arm instance requires. This is derived by an interesting and nontrivial reduction from the Best-$1$-Arm problem. We also provide an elimination-based algorithm that matches the instance-wise lower bound within doubly-logarithmic factors. The sample complexity of our algorithm strictly dominates the state-of-the-art for Best-$k$-Arm (module constant factors).

LGAug 22, 2016
Towards Instance Optimal Bounds for Best Arm Identification

Lijie Chen, Jian Li, Mingda Qiao

In the classical best arm identification (Best-$1$-Arm) problem, we are given $n$ stochastic bandit arms, each associated with a reward distribution with an unknown mean. We would like to identify the arm with the largest mean with probability at least $1-δ$, using as few samples as possible. Understanding the sample complexity of Best-$1$-Arm has attracted significant attention since the last decade. However, the exact sample complexity of the problem is still unknown. Recently, Chen and Li made the gap-entropy conjecture concerning the instance sample complexity of Best-$1$-Arm. Given an instance $I$, let $μ_{[i]}$ be the $i$th largest mean and $Δ_{[i]}=μ_{[1]}-μ_{[i]}$ be the corresponding gap. $H(I)=\sum_{i=2}^nΔ_{[i]}^{-2}$ is the complexity of the instance. The gap-entropy conjecture states that $Ω\left(H(I)\cdot\left(\lnδ^{-1}+\mathsf{Ent}(I)\right)\right)$ is an instance lower bound, where $\mathsf{Ent}(I)$ is an entropy-like term determined by the gaps, and there is a $δ$-correct algorithm for Best-$1$-Arm with sample complexity $O\left(H(I)\cdot\left(\lnδ^{-1}+\mathsf{Ent}(I)\right)+Δ_{[2]}^{-2}\ln\lnΔ_{[2]}^{-1}\right)$. If the conjecture is true, we would have a complete understanding of the instance-wise sample complexity of Best-$1$-Arm. We make significant progress towards the resolution of the gap-entropy conjecture. For the upper bound, we provide a highly nontrivial algorithm which requires \[O\left(H(I)\cdot\left(\lnδ^{-1} +\mathsf{Ent}(I)\right)+Δ_{[2]}^{-2}\ln\lnΔ_{[2]}^{-1}\mathrm{polylog}(n,δ^{-1})\right)\] samples in expectation. For the lower bound, we show that for any Gaussian Best-$1$-Arm instance with gaps of the form $2^{-k}$, any $δ$-correct monotone algorithm requires $Ω\left(H(I)\cdot\left(\lnδ^{-1} + \mathsf{Ent}(I)\right)\right)$ samples in expectation.