Towards Instance Optimal Bounds for Best Arm Identification
This work advances theoretical understanding of sample complexity in bandit optimization, which is incremental progress toward a complete solution for researchers in machine learning theory.
The paper tackles the best arm identification problem in stochastic bandits by making progress on the gap-entropy conjecture, providing an algorithm with sample complexity O(H(I)·(lnδ^{-1}+Ent(I))+Δ_{[2]}^{-2}lnlnΔ_{[2]}^{-1} polylog(n,δ^{-1})) and a lower bound of Ω(H(I)·(lnδ^{-1}+Ent(I))) for specific instances.
In the classical best arm identification (Best-$1$-Arm) problem, we are given $n$ stochastic bandit arms, each associated with a reward distribution with an unknown mean. We would like to identify the arm with the largest mean with probability at least $1-δ$, using as few samples as possible. Understanding the sample complexity of Best-$1$-Arm has attracted significant attention since the last decade. However, the exact sample complexity of the problem is still unknown. Recently, Chen and Li made the gap-entropy conjecture concerning the instance sample complexity of Best-$1$-Arm. Given an instance $I$, let $μ_{[i]}$ be the $i$th largest mean and $Δ_{[i]}=μ_{[1]}-μ_{[i]}$ be the corresponding gap. $H(I)=\sum_{i=2}^nΔ_{[i]}^{-2}$ is the complexity of the instance. The gap-entropy conjecture states that $Ω\left(H(I)\cdot\left(\lnδ^{-1}+\mathsf{Ent}(I)\right)\right)$ is an instance lower bound, where $\mathsf{Ent}(I)$ is an entropy-like term determined by the gaps, and there is a $δ$-correct algorithm for Best-$1$-Arm with sample complexity $O\left(H(I)\cdot\left(\lnδ^{-1}+\mathsf{Ent}(I)\right)+Δ_{[2]}^{-2}\ln\lnΔ_{[2]}^{-1}\right)$. If the conjecture is true, we would have a complete understanding of the instance-wise sample complexity of Best-$1$-Arm. We make significant progress towards the resolution of the gap-entropy conjecture. For the upper bound, we provide a highly nontrivial algorithm which requires \[O\left(H(I)\cdot\left(\lnδ^{-1} +\mathsf{Ent}(I)\right)+Δ_{[2]}^{-2}\ln\lnΔ_{[2]}^{-1}\mathrm{polylog}(n,δ^{-1})\right)\] samples in expectation. For the lower bound, we show that for any Gaussian Best-$1$-Arm instance with gaps of the form $2^{-k}$, any $δ$-correct monotone algorithm requires $Ω\left(H(I)\cdot\left(\lnδ^{-1} + \mathsf{Ent}(I)\right)\right)$ samples in expectation.