Toward Efficient Gradient-Based Value Estimation
This work addresses efficiency issues in reinforcement learning for practitioners, though it is incremental as it builds on existing gradient-based methods.
The paper tackled the slowness of gradient-based value estimation in reinforcement learning by identifying the ill-conditioned nature of the Mean Square Bellman Error loss and proposing the RANS algorithm, which is significantly faster than residual gradient methods and competitive with Temporal Difference learning on classic problems.
Gradient-based methods for value estimation in reinforcement learning have favorable stability properties, but they are typically much slower than Temporal Difference (TD) learning methods. We study the root causes of this slowness and show that Mean Square Bellman Error (MSBE) is an ill-conditioned loss function in the sense that its Hessian has large condition-number. To resolve the adverse effect of poor conditioning of MSBE on gradient based methods, we propose a low complexity batch-free proximal method that approximately follows the Gauss-Newton direction and is asymptotically robust to parameterization. Our main algorithm, called RANS, is efficient in the sense that it is significantly faster than the residual gradient methods while having almost the same computational complexity, and is competitive with TD on the classic problems that we tested.