LGAIFeb 3, 2024

SQT -- std $Q$-target

arXiv:2402.05950v3h-index: 16
AI Analysis

This addresses overestimation bias in RL for practitioners, offering an incremental improvement over existing methods.

The paper tackled overestimation bias in reinforcement learning by introducing SQT, a conservative actor-critic algorithm that uses standard deviation of Q-networks as an uncertainty penalty, and demonstrated its superiority over DDPG, TD3, and TD7 across seven MuJoCo and Bullet tasks with clear performance advantages.

Std $Q$-target is a conservative, actor-critic, ensemble, $Q$-learning-based algorithm, which is based on a single key $Q$-formula: $Q$-networks standard deviation, which is an "uncertainty penalty", and, serves as a minimalistic solution to the problem of overestimation bias. We implement SQT on top of TD3/TD7 code and test it against the state-of-the-art (SOTA) actor-critic algorithms, DDPG, TD3 and TD7 on seven popular MuJoCo and Bullet tasks. Our results demonstrate SQT's $Q$-target formula superiority over TD3's $Q$-target formula as a conservative solution to overestimation bias in RL, while SQT shows a clear performance advantage on a wide margin over DDPG, TD3, and TD7 on all tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes