Inference with the Upper Confidence Bound Algorithm
This work addresses the problem of reliable statistical inference in sequential decision-making for researchers and practitioners in machine learning and statistics, but it is incremental as it builds on existing stability concepts from prior work.
The paper tackles the challenge of performing inference when data is collected sequentially using the Upper Confidence Bound (UCB) algorithm in multi-armed bandit problems, showing that UCB satisfies a stability property leading to asymptotic normality of sample means for each arm, and it extends this analysis to cases where the number of arms grows with arm pulls under conditions like log K / log T → 0.
In this paper, we discuss the asymptotic behavior of the Upper Confidence Bound (UCB) algorithm in the context of multiarmed bandit problems and discuss its implication in downstream inferential tasks. While inferential tasks become challenging when data is collected in a sequential manner, we argue that this problem can be alleviated when the sequential algorithm at hand satisfies certain stability property. This notion of stability is motivated from the seminal work of Lai and Wei (1982). Our first main result shows that such a stability property is always satisfied for the UCB algorithm, and as a result the sample means for each arm are asymptotically normal. Next, we examine the stability properties of the UCB algorithm when the number of arms $K$ is allowed to grow with the number of arm pulls $T$. We show that in such a case the arms are stable when $\frac{\log K}{\log T} \rightarrow 0$, and the number of near-optimal arms are large.