LG AI ML OTDec 24, 2020

Regret Bound Balancing and Elimination for Model Selection in Bandits and RL

Aldo Pacchiano, Christoph Dann, Claudio Gentile, Peter Bartlett

arXiv:2012.13045v119.155 citations

Originality Incremental advance

AI Analysis

This work provides a more robust model selection framework for practitioners in bandits and RL by removing the strong assumption of knowing the optimal regret, which is a common challenge in real-world applications.

This paper introduces a model selection method for bandit and reinforcement learning algorithms that balances and eliminates base algorithms based on their candidate regret bounds. The approach achieves a total regret bounded by the best valid candidate regret bound multiplied by a small factor, which scales only with the number of base algorithms under certain conditions.

We propose a simple model selection approach for algorithms in stochastic bandit and reinforcement learning problems. As opposed to prior work that (implicitly) assumes knowledge of the optimal regret, we only require that each base algorithm comes with a candidate regret bound that may or may not hold during all rounds. In each round, our approach plays a base algorithm to keep the candidate regret bounds of all remaining base algorithms balanced, and eliminates algorithms that violate their candidate bound. We prove that the total regret of this approach is bounded by the best valid candidate regret bound times a multiplicative factor. This factor is reasonably small in several applications, including linear bandits and MDPs with nested function classes, linear bandits with unknown misspecification, and LinUCB applied to linear bandits with different confidence parameters. We further show that, under a suitable gap-assumption, this factor only scales with the number of base algorithms and not their complexity when the number of rounds is large enough. Finally, unlike recent efforts in model selection for linear stochastic bandits, our approach is versatile enough to also cover cases where the context information is generated by an adversarial environment, rather than a stochastic one.

View on arXiv PDF

Similar