Factored Bandits
This work addresses a fundamental challenge in bandit learning for scenarios with structured actions, offering a novel framework that could impact reinforcement learning and decision-making systems, though it appears incremental in extending existing bandit models.
The paper tackles the problem of learning with limited feedback in bandit settings where actions decompose into atomic components, introducing the factored bandits model that relaxes reward function assumptions. It provides an anytime algorithm with matching regret bounds and shows application to dueling bandits, improving additive regret terms compared to state-of-the-art methods, with additive terms dominating up to exponential time horizons.
We introduce the factored bandits model, which is a framework for learning with limited (bandit) feedback, where actions can be decomposed into a Cartesian product of atomic actions. Factored bandits incorporate rank-1 bandits as a special case, but significantly relax the assumptions on the form of the reward function. We provide an anytime algorithm for stochastic factored bandits and up to constants matching upper and lower regret bounds for the problem. Furthermore, we show that with a slight modification the proposed algorithm can be applied to utility based dueling bandits. We obtain an improvement in the additive terms of the regret bound compared to state of the art algorithms (the additive terms are dominating up to time horizons which are exponential in the number of arms).