LGGTMAMLSep 18, 2019

No-Regret Learning in Unknown Games with Correlated Payoffs

arXiv:1909.08540v247 citations
AI Analysis

This work addresses the challenge of limited feedback in multi-agent learning for applications like traffic routing and recommendation systems, offering a novel approach that bridges the gap between bandit and full-information performance.

The paper tackles the problem of learning in repeated multi-agent games with unknown reward functions by proposing GP-MW, a bandit algorithm that uses Gaussian processes and multiplicative weights to exploit correlations from observing opponents' actions, achieving regret bounds comparable to full-information settings and outperforming baselines in experiments on matrix games, traffic routing, and movie recommendation.

We consider the problem of learning to play a repeated multi-agent game with an unknown reward function. Single player online learning algorithms attain strong regret bounds when provided with full information feedback, which unfortunately is unavailable in many real-world scenarios. Bandit feedback alone, i.e., observing outcomes only for the selected action, yields substantially worse performance. In this paper, we consider a natural model where, besides a noisy measurement of the obtained reward, the player can also observe the opponents' actions. This feedback model, together with a regularity assumption on the reward function, allows us to exploit the correlations among different game outcomes by means of Gaussian processes (GPs). We propose a novel confidence-bound based bandit algorithm GP-MW, which utilizes the GP model for the reward function and runs a multiplicative weight (MW) method. We obtain novel kernel-dependent regret bounds that are comparable to the known bounds in the full information setting, while substantially improving upon the existing bandit results. We experimentally demonstrate the effectiveness of GP-MW in random matrix games, as well as real-world problems of traffic routing and movie recommendation. In our experiments, GP-MW consistently outperforms several baselines, while its performance is often comparable to methods that have access to full information feedback.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes