LG MLDec 4, 2020

One-bit feedback is sufficient for upper confidence bound policies

Daniel Vial, Sanjay Shakkottai, R. Srikant

arXiv:2012.02876v13.32 citations

Originality Incremental advance

AI Analysis

This work is significant for researchers and practitioners in online learning and bandit algorithms, showing that highly constrained feedback mechanisms can achieve near-optimal performance, potentially reducing communication overhead in resource-limited environments.

This paper addresses the multi-armed bandit problem where arms provide only one-bit feedback. The authors demonstrate that a coding and decoding scheme can be devised such that the regret of their one-bit feedback policy asymptotically approaches the regret of a full-reward feedback policy.

We consider a variant of the traditional multi-armed bandit problem in which each arm is only able to provide one-bit feedback during each pull based on its past history of rewards. Our main result is the following: given an upper confidence bound policy which uses full-reward feedback, there exists a coding scheme for generating one-bit feedback, and a corresponding decoding scheme and arm selection policy, such that the ratio of the regret achieved by our policy and the regret of the full-reward feedback policy asymptotically approaches one.

View on arXiv PDF

Similar