MA LGOct 31, 2021

Decentralized Multi-Agent Reinforcement Learning: An Off-Policy Method

arXiv:2111.00438v11.2

Originality Incremental advance

AI Analysis

This addresses the problem of scalable and privacy-preserving coordination in multi-agent systems for researchers and practitioners, though it appears incremental as it builds on existing actor-critic and MARL frameworks.

The paper tackles decentralized multi-agent reinforcement learning where agents have private policies and communicate locally, proposing a decentralized actor-critic method with policy evaluation and improvement algorithms for discrete and continuous spaces. Results show improved learning speed and final performance over baselines like Q-learning and MADDPG, with off-policy execution enhancing data efficiency.

We discuss the problem of decentralized multi-agent reinforcement learning (MARL) in this work. In our setting, the global state, action, and reward are assumed to be fully observable, while the local policy is protected as privacy by each agent, and thus cannot be shared with others. There is a communication graph, among which the agents can exchange information with their neighbors. The agents make individual decisions and cooperate to reach a higher accumulated reward. Towards this end, we first propose a decentralized actor-critic (AC) setting. Then, the policy evaluation and policy improvement algorithms are designed for discrete and continuous state-action-space Markov Decision Process (MDP) respectively. Furthermore, convergence analysis is given under the discrete-space case, which guarantees that the policy will be reinforced by alternating between the processes of policy evaluation and policy improvement. In order to validate the effectiveness of algorithms, we design experiments and compare them with previous algorithms, e.g., Q-learning \cite{watkins1992q} and MADDPG \cite{lowe2017multi}. The results show that our algorithms perform better from the aspects of both learning speed and final performance. Moreover, the algorithms can be executed in an off-policy manner, which greatly improves the data efficiency compared with on-policy algorithms.

View on arXiv PDF

Similar