Variance-Dependent Best Arm Identification
This work addresses the variance-dependent best arm identification problem for stochastic bandits, offering a significant improvement over prior variance-independent algorithms in favorable scenarios.
The paper tackles the problem of identifying the best arm in a stochastic multi-armed bandit by proposing an adaptive algorithm that uses variance-dependent sampling, achieving a sample complexity of O(∑(σ_i²/Δ_i² + 1/Δ_i)(ln δ⁻¹ + ln ln Δ_i⁻¹)) and removing an extra ln n factor compared to state-of-the-art methods, with a matching lower bound proving optimality up to doubly logarithmic terms.
We study the problem of identifying the best arm in a stochastic multi-armed bandit game. Given a set of $n$ arms indexed from $1$ to $n$, each arm $i$ is associated with an unknown reward distribution supported on $[0,1]$ with mean $θ_i$ and variance $σ_i^2$. Assume $θ_1 > θ_2 \geq \cdots \geqθ_n$. We propose an adaptive algorithm which explores the gaps and variances of the rewards of the arms and makes future decisions based on the gathered information using a novel approach called \textit{grouped median elimination}. The proposed algorithm guarantees to output the best arm with probability $(1-δ)$ and uses at most $O \left(\sum_{i = 1}^n \left(\frac{σ_i^2}{Δ_i^2} + \frac{1}{Δ_i}\right)(\ln δ^{-1} + \ln \ln Δ_i^{-1})\right)$ samples, where $Δ_i$ ($i \geq 2$) denotes the reward gap between arm $i$ and the best arm and we define $Δ_1 = Δ_2$. This achieves a significant advantage over the variance-independent algorithms in some favorable scenarios and is the first result that removes the extra $\ln n$ factor on the best arm compared with the state-of-the-art. We further show that $Ω\left( \sum_{i = 1}^n \left( \frac{σ_i^2}{Δ_i^2} + \frac{1}{Δ_i} \right) \ln δ^{-1} \right)$ samples are necessary for an algorithm to achieve the same goal, thereby illustrating that our algorithm is optimal up to doubly logarithmic terms.