One-bit feedback is sufficient for upper confidence bound policies
This work is significant for researchers and practitioners in online learning and bandit algorithms, showing that highly constrained feedback mechanisms can achieve near-optimal performance, potentially reducing communication overhead in resource-limited environments.
This paper addresses the multi-armed bandit problem where arms provide only one-bit feedback. The authors demonstrate that a coding and decoding scheme can be devised such that the regret of their one-bit feedback policy asymptotically approaches the regret of a full-reward feedback policy.
We consider a variant of the traditional multi-armed bandit problem in which each arm is only able to provide one-bit feedback during each pull based on its past history of rewards. Our main result is the following: given an upper confidence bound policy which uses full-reward feedback, there exists a coding scheme for generating one-bit feedback, and a corresponding decoding scheme and arm selection policy, such that the ratio of the regret achieved by our policy and the regret of the full-reward feedback policy asymptotically approaches one.