GT LG OCOct 3, 2018

Bandit learning in concave $N$-person games

Mario Bravo, David S. Leslie, Panayotis Mertikopoulos

arXiv:1810.01925v114.9141 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of stabilizing learning in low-information multi-agent environments, providing theoretical guarantees for convergence to equilibrium, though it is incremental as it builds on existing monotonicity conditions and bandit methods.

The paper tackles the problem of learning in non-cooperative concave games with bandit feedback, where agents have limited information, and shows that under a monotonicity condition, no-regret learning based on mirror descent converges to Nash equilibrium with probability 1, with a convergence rate nearly matching the best single-agent bandit stochastic optimization rate.

This paper examines the long-run behavior of learning with bandit feedback in non-cooperative concave games. The bandit framework accounts for extremely low-information environments where the agents may not even know they are playing a game; as such, the agents' most sensible choice in this setting would be to employ a no-regret learning algorithm. In general, this does not mean that the players' behavior stabilizes in the long run: no-regret learning may lead to cycles, even with perfect gradient information. However, if a standard monotonicity condition is satisfied, our analysis shows that no-regret learning based on mirror descent with bandit feedback converges to Nash equilibrium with probability $1$. We also derive an upper bound for the convergence rate of the process that nearly matches the best attainable rate for single-agent bandit stochastic optimization.

View on arXiv PDF

Similar