MAAILGFeb 26

QSIM: Mitigating Overestimation in Multi-Agent Reinforcement Learning via Action Similarity Weighted Q-Learning

arXiv:2602.22786v1h-index: 11Has Code
Originality Incremental advance
AI Analysis

This work tackles the problem of Q-value overestimation, a known issue in cooperative multi-agent reinforcement learning, aiming to improve learning stability and policy optimality for researchers and practitioners in MARL.

This paper addresses the problem of Q-value overestimation in multi-agent reinforcement learning (MARL) methods that use value decomposition. The authors propose QSIM, a framework that reconstructs the temporal-difference (TD) target by forming a similarity-weighted expectation over a structured near-greedy joint action space, leading to improved performance and stability across various value decomposition methods.

Value decomposition (VD) methods have achieved remarkable success in cooperative multi-agent reinforcement learning (MARL). However, their reliance on the max operator for temporal-difference (TD) target calculation leads to systematic Q-value overestimation. This issue is particularly severe in MARL due to the combinatorial explosion of the joint action space, which often results in unstable learning and suboptimal policies. To address this problem, we propose QSIM, a similarity weighted Q-learning framework that reconstructs the TD target using action similarity. Instead of using the greedy joint action directly, QSIM forms a similarity weighted expectation over a structured near-greedy joint action space. This formulation allows the target to integrate Q-values from diverse yet behaviorally related actions while assigning greater influence to those that are more similar to the greedy choice. By smoothing the target with structurally relevant alternatives, QSIM effectively mitigates overestimation and improves learning stability. Extensive experiments demonstrate that QSIM can be seamlessly integrated with various VD methods, consistently yielding superior performance and stability compared to the original algorithms. Furthermore, empirical analysis confirms that QSIM significantly mitigates the systematic value overestimation in MARL. Code is available at https://github.com/MaoMaoLYJ/pymarl-qsim.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes