Decentralized Policy Optimization
This addresses the problem of reliable decentralized learning for multi-agent systems, offering a theoretically grounded solution with empirical validation, though it is incremental in improving upon existing decentralized methods.
The paper tackles the lack of convergence guarantees in decentralized actor-critic methods for cooperative multi-agent reinforcement learning by proposing Decentralized Policy Optimization (DPO), which ensures monotonic improvement and outperforms independent PPO in most tasks across various environments.
The study of decentralized learning or independent learning in cooperative multi-agent reinforcement learning has a history of decades. Recently empirical studies show that independent PPO (IPPO) can obtain good performance, close to or even better than the methods of centralized training with decentralized execution, in several benchmarks. However, decentralized actor-critic with convergence guarantee is still open. In this paper, we propose \textit{decentralized policy optimization} (DPO), a decentralized actor-critic algorithm with monotonic improvement and convergence guarantee. We derive a novel decentralized surrogate for policy optimization such that the monotonic improvement of joint policy can be guaranteed by each agent \textit{independently} optimizing the surrogate. In practice, this decentralized surrogate can be realized by two adaptive coefficients for policy optimization at each agent. Empirically, we compare DPO with IPPO in a variety of cooperative multi-agent tasks, covering discrete and continuous action spaces, and fully and partially observable environments. The results show DPO outperforms IPPO in most tasks, which can be the evidence for our theoretical results.