LGAIMLMay 17, 2019

TBQ($σ$): Improving Efficiency of Trace Utilization for Off-Policy Reinforcement Learning

arXiv:1905.07237v17 citations
Originality Incremental advance
AI Analysis

This addresses a specific bottleneck in off-policy RL for control problems, offering an incremental improvement over existing methods.

The paper tackles the inefficiency of off-policy reinforcement learning with eligibility traces under greedy target policies by introducing TBQ(σ), which unifies tree-backup and Naive Q(λ) methods. It shows that for ε-greedy policies, this approach accelerates learning and improves performance by optimizing trace utilization.

Off-policy reinforcement learning with eligibility traces is challenging because of the discrepancy between target policy and behavior policy. One common approach is to measure the difference between two policies in a probabilistic way, such as importance sampling and tree-backup. However, existing off-policy learning methods based on probabilistic policy measurement are inefficient when utilizing traces under a greedy target policy, which is ineffective for control problems. The traces are cut immediately when a non-greedy action is taken, which may lose the advantage of eligibility traces and slow down the learning process. Alternatively, some non-probabilistic measurement methods such as General Q($λ$) and Naive Q($λ$) never cut traces, but face convergence problems in practice. To address the above issues, this paper introduces a new method named TBQ($σ$), which effectively unifies the tree-backup algorithm and Naive Q($λ$). By introducing a new parameter $σ$ to illustrate the \emph{degree} of utilizing traces, TBQ($σ$) creates an effective integration of TB($λ$) and Naive Q($λ$) and continuous role shift between them. The contraction property of TB($σ$) is theoretically analyzed for both policy evaluation and control settings. We also derive the online version of TBQ($σ$) and give the convergence proof. We empirically show that, for $ε\in(0,1]$ in $ε$-greedy policies, there exists some degree of utilizing traces for $λ\in[0,1]$, which can improve the efficiency in trace utilization for off-policy reinforcement learning, to both accelerate the learning process and improve the performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes