Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret
This addresses a fundamental limitation in online learning theory for scenarios where adversaries adapt to algorithm actions, though it is incremental as it builds on existing regret frameworks.
The paper tackles the inadequacy of standard regret in online bandit learning against adaptive adversaries by introducing policy regret, showing that sublinear policy regret is impossible against unbounded-memory adversaries but achievable with bounded memory via a conversion technique from existing algorithms.
Online learning algorithms are designed to learn even when their input is generated by an adversary. The widely-accepted formal definition of an online algorithm's ability to learn is the game-theoretic notion of regret. We argue that the standard definition of regret becomes inadequate if the adversary is allowed to adapt to the online algorithm's actions. We define the alternative notion of policy regret, which attempts to provide a more meaningful way to measure an online algorithm's performance against adaptive adversaries. Focusing on the online bandit setting, we show that no bandit algorithm can guarantee a sublinear policy regret against an adaptive adversary with unbounded memory. On the other hand, if the adversary's memory is bounded, we present a general technique that converts any bandit algorithm with a sublinear regret bound into an algorithm with a sublinear policy regret bound. We extend this result to other variants of regret, such as switching regret, internal regret, and swap regret.