LG AI OC MLMay 24, 2019

Neural Temporal-Difference and Q-Learning Provably Converge to Global Optima

Qi Cai, Zhuoran Yang, Jason D. Lee, Zhaoran Wang

arXiv:1905.10027v241 citations

Originality Highly original

AI Analysis

This addresses a fundamental convergence problem in deep reinforcement learning, providing theoretical guarantees for widely used algorithms.

The paper proves for the first time that neural temporal-difference (TD) learning converges to the global optimum for policy evaluation, achieving a sublinear rate, and extends this result to neural Q-learning and policy gradient algorithms.

Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, due to the nonlinearity in value function approximation, such a coupling leads to nonconvexity and even divergence in optimization. As a result, the global convergence of neural TD remains unclear. In this paper, we prove for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation. In particular, we show how such global convergence is enabled by the overparametrization of neural networks, which also plays a vital role in the empirical success of neural TD. Beyond policy evaluation, we establish the global convergence of neural (soft) Q-learning, which is further connected to that of policy gradient algorithms.

View on arXiv PDF

Similar