MA AI LGJul 12, 2022

Towards Global Optimality in Cooperative MARL with the Transformation And Distillation Framework

Jianing Ye, Chenghao Li, Yongqiang Dou, Jianhao Wang, Guangwen Yang, Chongjie Zhang

arXiv:2207.11143v45.16 citationsh-index: 33

Originality Highly original

AI Analysis

It addresses optimization issues in cooperative MARL for researchers, offering a novel method to ensure global optimality in decentralized execution.

The paper tackles the suboptimality of decentralized multi-agent reinforcement learning (MARL) algorithms when using gradient descent, proving theoretical suboptimality for policy gradient and value-decomposition methods. It proposes the Transformation And Distillation (TAD) framework, which guarantees optimality and shows significant outperformance on tasks like StarCraft II and football games.

Decentralized execution is one core demand in multi-agent reinforcement learning (MARL). Recently, most popular MARL algorithms have adopted decentralized policies to enable decentralized execution, and use gradient descent as the optimizer. However, there is hardly any theoretical analysis of these algorithms taking the optimization method into consideration, and we find that various popular MARL algorithms with decentralized policies are suboptimal in toy tasks when gradient descent is chosen as their optimization method. In this paper, we theoretically analyze two common classes of algorithms with decentralized policies -- multi-agent policy gradient methods and value-decomposition methods, and prove their suboptimality when gradient descent is used. To address the suboptimality issue, we propose the Transformation And Distillation (TAD) framework, which reformulates a multi-agent MDP as a special single-agent MDP with a sequential structure and enables decentralized execution by distilling the learned policy on the derived "single-agent" MDP. The approach is a two-stage learning paradigm that addresses the optimization problem in cooperative MARL, providing optimality guarantee with decent execution performance. Empirically, we implement TAD-PPO based on PPO, which can theoretically perform optimal policy learning in the finite multi-agent MDPs and shows significant outperformance on a large set of cooperative multi-agent tasks, from matrix game, hallway task, to StarCraft II, and football game.

View on arXiv PDF

Similar