GTAIMAOct 22, 2024

Convex Markov Games: A New Frontier for Multi-Agent Reinforcement Learning

arXiv:2410.16600v34 citationsh-index: 14ICML
Originality Incremental advance
AI Analysis

This work addresses the challenge of incorporating diverse preferences like fairness and safety in multi-agent systems, offering a new framework for sequential decision-making, though it appears incremental as it builds on existing Markov game concepts.

The paper tackles the problem of multi-agent reinforcement learning with non-additive preferences by introducing convex Markov games, which allow general convex preferences over occupancy measures and guarantee pure strategy Nash equilibria. The result includes an algorithm that approximates equilibria via gradient descent, demonstrating novel solutions in classic games, fair coordination, and safe robot behavior, with an example in the prisoner's dilemma achieving higher utility and three orders of magnitude less exploitability.

Behavioral diversity, expert imitation, fairness, safety goals and others give rise to preferences in sequential decision making domains that do not decompose additively across time. We introduce the class of convex Markov games that allow general convex preferences over occupancy measures. Despite infinite time horizon and strictly higher generality than Markov games, pure strategy Nash equilibria exist. Furthermore, equilibria can be approximated empirically by performing gradient descent on an upper bound of exploitability. Our experiments reveal novel solutions to classic repeated normal-form games, find fair solutions in a repeated asymmetric coordination game, and prioritize safe long-term behavior in a robot warehouse environment. In the prisoner's dilemma, our algorithm leverages transient imitation to find a policy profile that deviates from observed human play only slightly, yet achieves higher per-player utility while also being three orders of magnitude less exploitable.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes