Off-Policy Multi-Agent Decomposed Policy Gradients
This work addresses the problem of centralized-decentralized mismatch and credit assignment in multi-agent reinforcement learning, offering a novel method that improves performance for applications such as game AI and robotics, though it is incremental in building on existing actor-critic and decomposition ideas.
The paper tackles the performance gap between multi-agent policy gradient methods and value-based approaches by introducing a decomposed policy gradient method (DOP) that integrates value function decomposition into an actor-critic framework, resulting in significant outperformance over state-of-the-art algorithms on benchmarks like StarCraft II and multi-agent particle environments.
Multi-agent policy gradient (MAPG) methods recently witness vigorous progress. However, there is a significant performance discrepancy between MAPG methods and state-of-the-art multi-agent value-based approaches. In this paper, we investigate causes that hinder the performance of MAPG algorithms and present a multi-agent decomposed policy gradient method (DOP). This method introduces the idea of value function decomposition into the multi-agent actor-critic framework. Based on this idea, DOP supports efficient off-policy learning and addresses the issue of centralized-decentralized mismatch and credit assignment in both discrete and continuous action spaces. We formally show that DOP critics have sufficient representational capability to guarantee convergence. In addition, empirical evaluations on the StarCraft II micromanagement benchmark and multi-agent particle environments demonstrate that DOP significantly outperforms both state-of-the-art value-based and policy-based multi-agent reinforcement learning algorithms. Demonstrative videos are available at https://sites.google.com/view/dop-mapg/.