LG AI MAMay 26, 2023

A Model-Based Solution to the Offline Multi-Agent Reinforcement Learning Coordination Problem

Paul Barde, Jakob Foerster, Derek Nowrouzezahrai, Amy Zhang

arXiv:2305.17198v216.019 citations

Originality Highly original

AI Analysis

It addresses coordination issues in offline MARL for applications like robotics and economics where online data collection is costly or dangerous.

The paper tackles the offline multi-agent reinforcement learning coordination problem by identifying strategy agreement and fine-tuning challenges, and proposes a model-based method (MOMA-PPO) that generates synthetic data to solve coordination-intensive tasks, significantly outperforming model-free methods in toy and MuJoCo domains.

Training multiple agents to coordinate is an essential problem with applications in robotics, game theory, economics, and social sciences. However, most existing Multi-Agent Reinforcement Learning (MARL) methods are online and thus impractical for real-world applications in which collecting new interactions is costly or dangerous. While these algorithms should leverage offline data when available, doing so gives rise to what we call the offline coordination problem. Specifically, we identify and formalize the strategy agreement (SA) and the strategy fine-tuning (SFT) coordination challenges, two issues at which current offline MARL algorithms fail. Concretely, we reveal that the prevalent model-free methods are severely deficient and cannot handle coordination-intensive offline multi-agent tasks in either toy or MuJoCo domains. To address this setback, we emphasize the importance of inter-agent interactions and propose the very first model-based offline MARL method. Our resulting algorithm, Model-based Offline Multi-Agent Proximal Policy Optimization (MOMA-PPO) generates synthetic interaction data and enables agents to converge on a strategy while fine-tuning their policies accordingly. This simple model-based solution solves the coordination-intensive offline tasks, significantly outperforming the prevalent model-free methods even under severe partial observability and with learned world models.

View on arXiv PDF

Similar