QSIM: Mitigating Overestimation in Multi-Agent Reinforcement Learning via Action Similarity Weighted Q-Learning

Yuanjun Li, Bin Zhang, Hao Chen, Zhouyang Jiang, Dapeng Li, Zhiwei Xu

arXiv:2602.22786v11.2h-index: 14Has Code

Originality Incremental advance

AI Analysis

This work tackles the problem of Q-value overestimation, a known issue in cooperative multi-agent reinforcement learning, aiming to improve learning stability and policy optimality for researchers and practitioners in MARL.

This paper addresses the problem of Q-value overestimation in multi-agent reinforcement learning (MARL) methods that use value decomposition. The authors propose QSIM, a framework that reconstructs the temporal-difference (TD) target by forming a similarity-weighted expectation over a structured near-greedy joint action space, leading to improved performance and stability across various value decomposition methods.

Value decomposition (VD) methods have achieved remarkable success in cooperative multi-agent reinforcement learning (MARL). However, their reliance on the max operator for temporal-difference (TD) target calculation leads to systematic Q-value overestimation. This issue is particularly severe in MARL due to the combinatorial explosion of the joint action space, which often results in unstable learning and suboptimal policies. To address this problem, we propose QSIM, a similarity weighted Q-learning framework that reconstructs the TD target using action similarity. Instead of using the greedy joint action directly, QSIM forms a similarity weighted expectation over a structured near-greedy joint action space. This formulation allows the target to integrate Q-values from diverse yet behaviorally related actions while assigning greater influence to those that are more similar to the greedy choice. By smoothing the target with structurally relevant alternatives, QSIM effectively mitigates overestimation and improves learning stability. Extensive experiments demonstrate that QSIM can be seamlessly integrated with various VD methods, consistently yielding superior performance and stability compared to the original algorithms. Furthermore, empirical analysis confirms that QSIM significantly mitigates the systematic value overestimation in MARL. Code is available at https://github.com/MaoMaoLYJ/pymarl-qsim.

View on arXiv PDF Code

Similar