Bellman Error Centering
It addresses stability issues in reinforcement learning algorithms, but is incremental as it builds on existing reward centering methods.
The paper identifies that value-based reward centering (VRC) is actually Bellman error centering (BEC), establishes centered fixpoints for tabular and linear value functions, and proposes CTD and CTDC algorithms with proven convergence and experimental validation of stability.
This paper revisits the recently proposed reward centering algorithms including simple reward centering (SRC) and value-based reward centering (VRC), and points out that SRC is indeed the reward centering, while VRC is essentially Bellman error centering (BEC). Based on BEC, we provide the centered fixpoint for tabular value functions, as well as the centered TD fixpoint for linear value function approximation. We design the on-policy CTD algorithm and the off-policy CTDC algorithm, and prove the convergence of both algorithms. Finally, we experimentally validate the stability of our proposed algorithms. Bellman error centering facilitates the extension to various reinforcement learning algorithms.