Nearly Tight Bounds for Exploration in Streaming Multi-armed Bandits with Known Optimality Gap
This work addresses a key open problem in streaming bandit algorithms for scenarios with known gap information, providing nearly tight bounds that are incremental to prior results.
The paper tackles the problem of determining the pass complexity for pure exploration in multi-pass streaming multi-armed bandits with known optimality gap, showing that Θ(log n) passes are necessary and sufficient to achieve O(∑_{i=2}^n 1/Δ_{[i]}^2 * log n) arm pulls with sublinear memory.
We investigate the sample-memory-pass trade-offs for pure exploration in multi-pass streaming multi-armed bandits (MABs) with the *a priori* knowledge of the optimality gap $Δ_{[2]}$. Here, and throughout, the optimality gap $Δ_{[i]}$ is defined as the mean reward gap between the best and the $i$-th best arms. A recent line of results by Jin, Huang, Tang, and Xiao [ICML'21] and Assadi and Wang [COLT'24] have shown that if there is no known $Δ_{[2]}$, a pass complexity of $Θ(\log(1/Δ_{[2]}))$ (up to $\log\log(1/Δ_{[2]})$ terms) is necessary and sufficient to obtain the *worst-case optimal* sample complexity of $O(n/Δ^{2}_{[2]})$ with a single-arm memory. However, our understanding of multi-pass algorithms with known $Δ_{[2]}$ is still limited. Here, the key open problem is how many passes are required to achieve the complexity, i.e., $O( \sum_{i=2}^{n}1/Δ^2_{[i]})$ arm pulls, with a sublinear memory size. In this work, we show that the ``right answer'' for the question is $Θ(\log{n})$ passes (up to $\log\log{n}$ terms). We first present a lower bound, showing that any algorithm that finds the best arm with slightly sublinear memory -- a memory of $o({n}/{\text{polylog}({n})})$ arms -- and $O(\sum_{i=2}^{n}{1}/{Δ^{2}_{[i]}}\cdot \log{(n)})$ arm pulls has to make $Ω(\frac{\log{n}}{\log\log{n}})$ passes over the stream. We then show a nearly-matching algorithm that assuming the knowledge of $Δ_{[2]}$, finds the best arm with $O( \sum_{i=2}^{n}1/Δ^2_{[i]} \cdot \log{n})$ arm pulls and a *single arm* memory.