LG GT OCDec 6, 2021

Doubly Optimal No-Regret Online Learning in Strongly Monotone Games with Bandit Feedback

Wenjia Ba, Tianyi Lin, Jiawei Zhang, Zhengyuan Zhou

arXiv:2112.02856v414.126 citationsHas Code

Originality Highly original

AI Analysis

This provides a doubly optimal solution for bandit game-theoretical learning, advancing the state-of-the-art in multi-agent systems with limited feedback.

The paper tackles the problem of optimal no-regret online learning in strongly monotone games with bandit feedback, where players only observe rewards rather than gradients. It presents an algorithm that achieves the optimal regret of Θ̃(n√T) for single-agent learning and the optimal last-iterate convergence rate of Θ̃(nT^{-1/2}) to Nash equilibrium in multi-agent settings, settling an open problem.

We consider online no-regret learning in unknown games with bandit feedback, where each player can only observe its reward at each time -- determined by all players' current joint action -- rather than its gradient. We focus on the class of \textit{smooth and strongly monotone} games and study optimal no-regret learning therein. Leveraging self-concordant barrier functions, we first construct a new bandit learning algorithm and show that it achieves the single-agent optimal regret of $\tildeΘ(n\sqrt{T})$ under smooth and strongly concave reward functions ($n \geq 1$ is the problem dimension). We then show that if each player applies this no-regret learning algorithm in strongly monotone games, the joint action converges in the \textit{last iterate} to the unique Nash equilibrium at a rate of $\tildeΘ(nT^{-1/2})$. Prior to our work, the best-known convergence rate in the same class of games is $\tilde{O}(n^{2/3}T^{-1/3})$ (achieved by a different algorithm), thus leaving open the problem of optimal no-regret learning algorithms (since the known lower bound is $Ω(nT^{-1/2})$). Our results thus settle this open problem and contribute to the broad landscape of bandit game-theoretical learning by identifying the first doubly optimal bandit learning algorithm, in that it achieves (up to log factors) both optimal regret in the single-agent learning and optimal last-iterate convergence rate in the multi-agent learning. We also present preliminary numerical results on several application problems to demonstrate the efficacy of our algorithm in terms of iteration count.

View on arXiv PDF Code

Similar