lil' UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits
This provides an optimal exploration algorithm for multi-armed bandits, addressing a core challenge in sequential decision-making with applications in areas like recommendation systems and clinical trials.
The paper tackles the problem of identifying the best arm in multi-armed bandits with fixed confidence, proposing a novel UCB algorithm that reduces sample complexity. The result shows optimality up to constants and superior performance in simulations compared to state-of-the-art methods.
The paper proposes a novel upper confidence bound (UCB) procedure for identifying the arm with the largest mean in a multi-armed bandit game in the fixed confidence setting using a small number of total samples. The procedure cannot be improved in the sense that the number of samples required to identify the best arm is within a constant factor of a lower bound based on the law of the iterated logarithm (LIL). Inspired by the LIL, we construct our confidence bounds to explicitly account for the infinite time horizon of the algorithm. In addition, by using a novel stopping time for the algorithm we avoid a union bound over the arms that has been observed in other UCB-type algorithms. We prove that the algorithm is optimal up to constants and also show through simulations that it provides superior performance with respect to the state-of-the-art.