LG AIFeb 3, 2024

SQT -- std $Q$-target

Nitsan Soffair, Dotan Di-Castro, Orly Avner, Shie Mannor

arXiv:2402.05950v32.6h-index: 16

Originality Incremental advance

AI Analysis

This addresses overestimation bias in RL for practitioners, offering an incremental improvement over existing methods.

The paper tackled overestimation bias in reinforcement learning by introducing SQT, a conservative actor-critic algorithm that uses standard deviation of Q-networks as an uncertainty penalty, and demonstrated its superiority over DDPG, TD3, and TD7 across seven MuJoCo and Bullet tasks with clear performance advantages.

Std $Q$-target is a conservative, actor-critic, ensemble, $Q$-learning-based algorithm, which is based on a single key $Q$-formula: $Q$-networks standard deviation, which is an "uncertainty penalty", and, serves as a minimalistic solution to the problem of overestimation bias. We implement SQT on top of TD3/TD7 code and test it against the state-of-the-art (SOTA) actor-critic algorithms, DDPG, TD3 and TD7 on seven popular MuJoCo and Bullet tasks. Our results demonstrate SQT's $Q$-target formula superiority over TD3's $Q$-target formula as a conservative solution to overestimation bias in RL, while SQT shows a clear performance advantage on a wide margin over DDPG, TD3, and TD7 on all tasks.

View on arXiv PDF

Similar