LG MLNov 13, 2022

Generalizing distribution of partial rewards for multi-armed bandits with temporally-partitioned rewards

Ronald C. van den Broek, Rik Litjens, Tobias Sagis, Luc Siecker, Nina Verbeeke, Pratik Gajane

arXiv:2211.06883v13.31 citationsh-index: 8

Originality Incremental advance

AI Analysis

This work addresses a specific problem in online decision-making for applications with non-uniform reward distributions, representing an incremental advancement in bandit theory.

The paper tackles the Multi-Armed Bandit problem with Temporally-Partitioned Rewards by introducing a Beta-spread property to generalize reward distribution across rounds, deriving a lower bound and proposing the TP-UCB-FR-G algorithm that improves regret upper bounds in some scenarios.

We investigate the Multi-Armed Bandit problem with Temporally-Partitioned Rewards (TP-MAB) setting in this paper. In the TP-MAB setting, an agent will receive subsets of the reward over multiple rounds rather than the entire reward for the arm all at once. In this paper, we introduce a general formulation of how an arm's cumulative reward is distributed across several rounds, called Beta-spread property. Such a generalization is needed to be able to handle partitioned rewards in which the maximum reward per round is not distributed uniformly across rounds. We derive a lower bound on the TP-MAB problem under the assumption that Beta-spread holds. Moreover, we provide an algorithm TP-UCB-FR-G, which uses the Beta-spread property to improve the regret upper bound in some scenarios. By generalizing how the cumulative reward is distributed, this setting is applicable in a broader range of applications.

View on arXiv PDF

Similar