MA AI GT LGMay 27, 2023

Reinforcement Learning With Reward Machines in Stochastic Games

Jueming Hu, Jean-Raphael Gaglione, Yanze Wang, Zhe Xu, Ufuk Topcu, Yongming Liu

arXiv:2305.17372v31.2

Originality Incremental advance

AI Analysis

This addresses complex task learning in multi-agent systems, offering a novel method for handling non-Markovian rewards, though it is incremental as it builds on existing reward machine and Q-learning frameworks.

The paper tackles multi-agent reinforcement learning in stochastic games with non-Markovian rewards by developing QRM-SG, an algorithm that integrates reward machines to learn best-response strategies at Nash equilibrium, achieving convergence in 7500, 1000, and 1500 episodes across three case studies where baselines failed.

We investigate multi-agent reinforcement learning for stochastic games with complex tasks, where the reward functions are non-Markovian. We utilize reward machines to incorporate high-level knowledge of complex tasks. We develop an algorithm called Q-learning with reward machines for stochastic games (QRM-SG), to learn the best-response strategy at Nash equilibrium for each agent. In QRM-SG, we define the Q-function at a Nash equilibrium in augmented state space. The augmented state space integrates the state of the stochastic game and the state of reward machines. Each agent learns the Q-functions of all agents in the system. We prove that Q-functions learned in QRM-SG converge to the Q-functions at a Nash equilibrium if the stage game at each time step during learning has a global optimum point or a saddle point, and the agents update Q-functions based on the best-response strategy at this point. We use the Lemke-Howson method to derive the best-response strategy given current Q-functions. The three case studies show that QRM-SG can learn the best-response strategies effectively. QRM-SG learns the best-response strategies after around 7500 episodes in Case Study I, 1000 episodes in Case Study II, and 1500 episodes in Case Study III, while baseline methods such as Nash Q-learning and MADDPG fail to converge to the Nash equilibrium in all three case studies.

View on arXiv PDF

Similar