Multi-agent Reach-avoid MDP via Potential Games and Low-rank Policy Structure

Adam Casselman, Abraham P. Vinod, Sarah H. Q. Li

arXiv:2410.176900.81 citationsh-index: 3

Predicted impact top 99% in SY · last 90 daysOriginality Incremental advance

AI Analysis

This addresses scalability issues in multi-agent reinforcement learning for safety-critical applications like robotics or autonomous systems, though it appears incremental as it builds on existing MDP and game theory frameworks.

The paper tackles the exponential scaling of communication, memory, and computation in multi-agent reach-avoid MDPs by restricting solutions to local feedback policies, showing they are rank-one factorizations of global policies and that the problem has a potential game structure. Numerical simulations show significant reductions in peak memory usage and offline computation complexity while maintaining approximation error to the optimal objective.

We optimize finite horizon multi-agent reach-avoid Markov decision process (MDP) via \emph{local feedback policies}. The global feedback policy solution yields global optimality but its communication complexity, memory usage and computation complexity scale exponentially with the number of agents. We mitigate this exponential dependency by restricting the solution space to local feedback policies and show that local feedback policies are rank-one factorizations of global feedback policies, which provides a principled approach to reducing communication complexity and memory usage. Additionally, by demonstrating that multi-agent reach-avoid MDPs over local feedback policies has a potential game structure, we show that iterative best response is a tractable multi-agent learning scheme with guaranteed convergence to deterministic Nash equilibrium, and derive each agent's best response via multiplicative dynamic program (DP) over the joint state space. Numerical simulations across different MDPs and agent sets show that the peak memory usage and offline computation complexity are significantly reduced while the approximation error to the optimal global reach-avoid objective is maintained.

View on arXiv PDF

Similar