ML LGMar 9, 2024

Statistical Efficiency of Distributional Temporal Difference Learning and Freedman's Inequality in Hilbert Spaces

arXiv:2403.05811v49.25 citationsh-index: 4NIPS

Originality Highly original

AI Analysis

This work provides theoretical guarantees for distributional reinforcement learning, which is incremental as it extends classic policy evaluation results to the distributional setting.

The paper tackles the problem of distributional policy evaluation in reinforcement learning by analyzing the non-asymptotic statistical rates of distributional temporal difference learning, achieving minimax optimal sample complexity bounds of $ ilde{O}(\varepsilon^{-2}\mu_{\min}^{-1}(1-\gamma)^{-3})$ for an $\varepsilon$-optimal estimator with high probability in the $1$-Wasserstein distance.

Distributional reinforcement learning (DRL) has achieved empirical success in various domains. One core task in DRL is distributional policy evaluation, which involves estimating the return distribution $η^π$ for a given policy $π$. Distributional temporal difference learning has been accordingly proposed, which extends the classic temporal difference learning (TD) in RL. In this paper, we focus on the non-asymptotic statistical rates of distributional TD. To facilitate theoretical analysis, we propose non-parametric distributional TD (NTD). For a $γ$-discounted infinite-horizon tabular Markov decision process, we show that for NTD with a generative model, we need $\tilde{O}(\varepsilon^{-2}μ_{\min}^{-1}(1-γ)^{-3})$ interactions with the environment to achieve an $\varepsilon$-optimal estimator with high probability, when the estimation error is measured by the $1$-Wasserstein. This sample complexity bound is minimax optimal up to logarithmic factors. In addition, we revisit categorical distributional TD (CTD), showing that the same non-asymptotic convergence bounds hold for CTD in the case of the $1$-Wasserstein distance. We also extend our analysis to the more general setting where the data generating process is Markovian. In the Markovian setting, we propose variance-reduced variants of NTD and CTD, and show that both can achieve a $\tilde{O}(\varepsilon^{-2} μ_{π,\min}^{-1}(1-γ)^{-3}+t_{mix}μ_{π,\min}^{-1}(1-γ)^{-1})$ sample complexity bounds in the case of the $1$-Wasserstein distance, which matches the state-of-the-art statistical results for classic policy evaluation. To achieve the sharp statistical rates, we establish a novel Freedman's inequality in Hilbert spaces. This new Freedman's inequality would be of independent interest for statistical analysis of various infinite-dimensional online learning problems.

View on arXiv PDF

Similar