AIJul 14, 2021

Centralized Model and Exploration Policy for Multi-Agent RL

arXiv:2107.06434v221 citations
Originality Incremental advance
AI Analysis

This addresses a critical bottleneck for applying RL to real-world multi-agent systems like rescue robots or quadcopters, where environment interaction is costly, though it is incremental as it builds on existing model-based methods.

The paper tackles the high sample complexity problem in decentralized partially observable Markov decision processes (Dec-POMDPs) for multi-agent reinforcement learning, proposing a model-based algorithm (MARCO) that improves sample efficiency by up to 20x in cooperative communication tasks and achieves polynomial sample complexity theoretically.

Reinforcement learning (RL) in partially observable, fully cooperative multi-agent settings (Dec-POMDPs) can in principle be used to address many real-world challenges such as controlling a swarm of rescue robots or a team of quadcopters. However, Dec-POMDPs are significantly harder to solve than single-agent problems, with the former being NEXP-complete and the latter, MDPs, being just P-complete. Hence, current RL algorithms for Dec-POMDPs suffer from poor sample complexity, which greatly reduces their applicability to practical problems where environment interaction is costly. Our key insight is that using just a polynomial number of samples, one can learn a centralized model that generalizes across different policies. We can then optimize the policy within the learned model instead of the true system, without requiring additional environment interactions. We also learn a centralized exploration policy within our model that learns to collect additional data in state-action regions with high model uncertainty. We empirically evaluate the proposed model-based algorithm, MARCO, in three cooperative communication tasks, where it improves sample efficiency by up to 20x. Finally, to investigate the theoretical sample complexity, we adapt an existing model-based method for tabular MDPs to Dec-POMDPs, and prove that it achieves polynomial sample complexity.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes