Learning Partial Action Replacement in Offline MARL

arXiv:2603.2857356.1h-index: 2

AI Analysis

This work addresses the problem of dataset coverage sparsity and computational inefficiency in offline MARL for researchers and practitioners, representing an incremental improvement over existing partial action replacement methods.

The paper tackles the challenge of exponential joint action space growth in offline multi-agent reinforcement learning by introducing PLCQL, a framework that formulates partial action replacement subset selection as a contextual bandit problem, learning a state-dependent policy to dynamically determine how many agents to replace. Empirically, PLCQL achieves the highest normalized scores on 66% of tasks across benchmarks, outperforming prior methods on 84% of tasks while reducing per-iteration Q-function evaluations from n to 1, significantly improving computational efficiency.

Offline multi-agent reinforcement learning (MARL) faces a critical challenge: the joint action space grows exponentially with the number of agents, making dataset coverage exponentially sparse and out-of-distribution (OOD) joint actions unavoidable. Partial Action Replacement (PAR) mitigates this by anchoring a subset of agents to dataset actions, but existing approach relies on enumerating multiple subset configurations at high computational cost and cannot adapt to varying states. We introduce PLCQL, a framework that formulates PAR subset selection as a contextual bandit problem and learns a state-dependent PAR policy using Proximal Policy Optimisation with an uncertainty-weighted reward. This adaptive policy dynamically determines how many agents to replace at each update step, balancing policy improvement against conservative value estimation. We prove a value-error bound showing that the estimation error scales linearly with the expected number of deviating agents. Compared with the previous PAR-based method SPaCQL, PLCQL reduces the number of per-iteration Q-function evaluations from n to 1, significantly improving computational efficiency. Empirically, PLCQL achieves the highest normalised scores on 66% of tasks across MPE, MaMuJoCo, and SMAC benchmarks, outperforming SPaCQL on 84% of tasks while substantially reducing computational cost.

View on arXiv PDF

Similar