MinMaxMin $Q$-learning
This addresses a specific problem in reinforcement learning for continuous control tasks, offering incremental improvements over existing algorithms.
The paper tackles overestimation bias in conservative reinforcement learning algorithms by introducing MinMaxMin Q-learning, an optimistic Actor-Critic method that uses disagreement among Q-networks to adjust Q-targets and sampling rules, resulting in consistent performance improvements over DDPG, TD3, and TD7 across MuJoCo and Bullet environments.
MinMaxMin $Q$-learning is a novel optimistic Actor-Critic algorithm that addresses the problem of overestimation bias ($Q$-estimations are overestimating the real $Q$-values) inherent in conservative RL algorithms. Its core formula relies on the disagreement among $Q$-networks in the form of the min-batch MaxMin $Q$-networks distance which is added to the $Q$-target and used as the priority experience replay sampling-rule. We implement MinMaxMin on top of TD3 and TD7, subjecting it to rigorous testing against state-of-the-art continuous-space algorithms-DDPG, TD3, and TD7-across popular MuJoCo and Bullet environments. The results show a consistent performance improvement of MinMaxMin over DDPG, TD3, and TD7 across all tested tasks.