MA LGFeb 4, 2025

Dual Ensembled Multiagent Q-Learning with Hypernet Regularizer

Yaodong Yang, Guangyong Chen, Hongyao Tang, Furui Liu, Danruo Deng, Pheng Ann Heng

arXiv:2502.02018v13.32 citationsh-index: 24Has CodeAAMAS

Originality Incremental advance

AI Analysis

This work addresses a critical stability issue in multiagent systems, offering a novel solution to reduce overestimation that scales with agent numbers, though it is incremental as it builds on existing single-agent methods.

The paper tackles overestimation in multiagent reinforcement learning by proposing a dual approach that extends random ensemble techniques for target Q-value estimation and introduces a hypernet regularizer to constrain online Q-network optimization, achieving successful results across MPE and SMAC tasks.

Overestimation in single-agent reinforcement learning has been extensively studied. In contrast, overestimation in the multiagent setting has received comparatively little attention although it increases with the number of agents and leads to severe learning instability. Previous works concentrate on reducing overestimation in the estimation process of target Q-value. They ignore the follow-up optimization process of online Q-network, thus making it hard to fully address the complex multiagent overestimation problem. To solve this challenge, in this study, we first establish an iterative estimation-optimization analysis framework for multiagent value-mixing Q-learning. Our analysis reveals that multiagent overestimation not only comes from the computation of target Q-value but also accumulates in the online Q-network's optimization. Motivated by it, we propose the Dual Ensembled Multiagent Q-Learning with Hypernet Regularizer algorithm to tackle multiagent overestimation from two aspects. First, we extend the random ensemble technique into the estimation of target individual and global Q-values to derive a lower update target. Second, we propose a novel hypernet regularizer on hypernetwork weights and biases to constrain the optimization of online global Q-network to prevent overestimation accumulation. Extensive experiments in MPE and SMAC show that the proposed method successfully addresses overestimation across various tasks.

View on arXiv PDF Code

Similar