The Pareto Frontier of model selection for general Contextual Bandits
This resolves open problems in bandit theory, providing foundational insights into model selection trade-offs, though it is incremental in advancing theoretical understanding.
The paper addresses the fundamental limits of model selection for general contextual bandits with nested policy classes, showing that it is impossible to achieve optimal guarantees simultaneously across all policies, and establishes a Pareto frontier with unavoidable logarithmic complexity increases.
Recent progress in model selection raises the question of the fundamental limits of these techniques. Under specific scrutiny has been model selection for general contextual bandits with nested policy classes, resulting in a COLT2020 open problem. It asks whether it is possible to obtain simultaneously the optimal single algorithm guarantees over all policies in a nested sequence of policy classes, or if otherwise this is possible for a trade-off $α\in[\frac{1}{2},1)$ between complexity term and time: $\ln(|Π_m|)^{1-α}T^α$. We give a disappointing answer to this question. Even in the purely stochastic regime, the desired results are unobtainable. We present a Pareto frontier of up to logarithmic factors matching upper and lower bounds, thereby proving that an increase in the complexity term $\ln(|Π_m|)$ independent of $T$ is unavoidable for general policy classes. As a side result, we also resolve a COLT2016 open problem concerning second-order bounds in full-information games.