AIJul 14, 2021

Centralized Model and Exploration Policy for Multi-Agent RL

Qizhen Zhang, Chris Lu, Animesh Garg, Jakob Foerster

arXiv:2107.06434v211.121 citationsh-index: 37Has Code

Originality Incremental advance

AI Analysis

This addresses a critical bottleneck for applying RL to real-world multi-agent systems like rescue robots or quadcopters, where environment interaction is costly, though it is incremental as it builds on existing model-based methods.

The paper tackles the high sample complexity problem in decentralized partially observable Markov decision processes (Dec-POMDPs) for multi-agent reinforcement learning, proposing a model-based algorithm (MARCO) that improves sample efficiency by up to 20x in cooperative communication tasks and achieves polynomial sample complexity theoretically.

Reinforcement learning (RL) in partially observable, fully cooperative multi-agent settings (Dec-POMDPs) can in principle be used to address many real-world challenges such as controlling a swarm of rescue robots or a team of quadcopters. However, Dec-POMDPs are significantly harder to solve than single-agent problems, with the former being NEXP-complete and the latter, MDPs, being just P-complete. Hence, current RL algorithms for Dec-POMDPs suffer from poor sample complexity, which greatly reduces their applicability to practical problems where environment interaction is costly. Our key insight is that using just a polynomial number of samples, one can learn a centralized model that generalizes across different policies. We can then optimize the policy within the learned model instead of the true system, without requiring additional environment interactions. We also learn a centralized exploration policy within our model that learns to collect additional data in state-action regions with high model uncertainty. We empirically evaluate the proposed model-based algorithm, MARCO, in three cooperative communication tasks, where it improves sample efficiency by up to 20x. Finally, to investigate the theoretical sample complexity, we adapt an existing model-based method for tabular MDPs to Dec-POMDPs, and prove that it achieves polynomial sample complexity.

View on arXiv PDF Code

Similar