SYAIGTMAOct 14, 2022

Decentralized Policy Gradient for Nash Equilibria Learning of General-sum Stochastic Games

arXiv:2210.07651v24 citationsh-index: 15
Originality Incremental advance
AI Analysis

This addresses the challenge of decentralized multi-agent reinforcement learning in competitive environments, offering a novel theoretical framework with convergence guarantees, though it appears incremental in extending existing variational inequality methods to stochastic games.

The paper tackles the problem of learning Nash equilibria in general-sum stochastic games with unknown transition dynamics, where agents only observe their own rewards and the environment state. It proposes a two-loop algorithm for exact pseudo gradients achieving convergence to a k^{1/2}-weighted asymptotic Nash equilibrium, and a decentralized algorithm with Monte-Carlo gradient estimation achieving convergence to a k^{1/4}-weighted asymptotic Nash equilibrium in probability.

We study Nash equilibria learning of a general-sum stochastic game with an unknown transition probability density function. Agents take actions at the current environment state and their joint action influences the transition of the environment state and their immediate rewards. Each agent only observes the environment state and its own immediate reward and is unknown about the actions or immediate rewards of others. We introduce the concepts of weighted asymptotic Nash equilibrium with probability 1 and in probability. For the case with exact pseudo gradients, we design a two-loop algorithm by the equivalence of Nash equilibrium and variational inequality problems. In the outer loop, we sequentially update a constructed strongly monotone variational inequality by updating a proximal parameter while employing a single-call extra-gradient algorithm in the inner loop for solving the constructed variational inequality. We show that if the associated Minty variational inequality has a solution, then the designed algorithm converges to the k^{1/2}-weighted asymptotic Nash equilibrium. Further, for the case with unknown pseudo gradients, we propose a decentralized algorithm, where the G(PO)MDP gradient estimator of the pseudo gradient is provided by Monte-Carlo simulations. The convergence to the k^{1/4} -weighted asymptotic Nash equilibrium in probability is achieved.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes