Best of Both Worlds: Regret Minimization versus Minimax Play
This addresses a fundamental challenge in online learning and game theory by enabling algorithms to adaptively balance safety and exploitation, which is incremental as it builds on existing regret minimization and minimax frameworks.
The paper tackles the problem of designing online learning algorithms with bandit feedback that simultaneously achieve constant regret against a comparator strategy and sublinear regret against any fixed strategy, providing the first affirmative answer when the comparator supports every action. The results enable risk of at most O(1) loss while gaining Ω(T) from exploitable opponents in zero-sum games, combining benefits of no-regret algorithms and minimax play.
In this paper, we investigate the existence of online learning algorithms with bandit feedback that simultaneously guarantee $O(1)$ regret compared to a given comparator strategy, and $\tilde{O}(\sqrt{T})$ regret compared to any fixed strategy, where $T$ is the number of rounds. We provide the first affirmative answer to this question whenever the comparator strategy supports every action. In the context of zero-sum games with min-max value zero, both in normal- and extensive form, we show that our results allow us to guarantee to risk at most $O(1)$ loss while being able to gain $Ω(T)$ from exploitable opponents, thereby combining the benefits of both no-regret algorithms and minimax play.