LGAIMLJun 25, 2019

Expected Sarsa($λ$) with Control Variate for Variance Reduction

arXiv:1906.11058v2
Originality Incremental advance
AI Analysis

This addresses variance reduction for off-policy learning in reinforcement learning, which is critical for stability but is incremental as it builds on existing Expected Sarsa methods.

The paper tackles the high variance problem in off-policy reinforcement learning by introducing a control variate technique to Expected Sarsa(λ), proposing ES(λ)-CV for tabular settings and GES(λ) with linear function approximation. It proves lower variance for ES(λ)-CV and shows GES(λ) achieves O(1/T) convergence rate, outperforming state-of-art algorithms like GQ(λ) in numerical experiments.

Off-policy learning is powerful for reinforcement learning. However, the high variance of off-policy evaluation is a critical challenge, which causes off-policy learning falls into an uncontrolled instability. In this paper, for reducing the variance, we introduce control variate technique to $\mathtt{Expected}$ $\mathtt{Sarsa}$($λ$) and propose a tabular $\mathtt{ES}$($λ$)-$\mathtt{CV}$ algorithm. We prove that if a proper estimator of value function reaches, the proposed $\mathtt{ES}$($λ$)-$\mathtt{CV}$ enjoys a lower variance than $\mathtt{Expected}$ $\mathtt{Sarsa}$($λ$). Furthermore, to extend $\mathtt{ES}$($λ$)-$\mathtt{CV}$ to be a convergent algorithm with linear function approximation, we propose the $\mathtt{GES}$($λ$) algorithm under the convex-concave saddle-point formulation. We prove that the convergence rate of $\mathtt{GES}$($λ$) achieves $\mathcal{O}(1/T)$, which matches or outperforms lots of state-of-art gradient-based algorithms, but we use a more relaxed condition. Numerical experiments show that the proposed algorithm performs better with lower variance than several state-of-art gradient-based TD learning algorithms: $\mathtt{GQ}$($λ$), $\mathtt{GTB}$($λ$) and $\mathtt{ABQ}$($ζ$).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes