Greedy Bandits with Sampled Context
This is an incremental improvement for contextual bandit algorithms, potentially benefiting reinforcement learning applications.
The paper tackles the problem of contextual multi-armed bandits by proposing GB-SC, which combines Thompson Sampling for prior development from context with an epsilon-greedy policy for arm selection, showing competitive performance on the Mushroom environment in terms of expected regret and cumulative regret.
Bayesian strategies for contextual bandits have proved promising in single-state reinforcement learning tasks by modeling uncertainty using context information from the environment. In this paper, we propose Greedy Bandits with Sampled Context (GB-SC), a method for contextual multi-armed bandits to develop the prior from the context information using Thompson Sampling, and arm selection using an epsilon-greedy policy. The framework GB-SC allows for evaluation of context-reward dependency, as well as providing robustness for partially observable context vectors by leveraging the prior developed. Our experimental results show competitive performance on the Mushroom environment in terms of expected regret and expected cumulative regret, as well as insights on how each context subset affects decision-making.