Adaptive Tree Backup Algorithms for Temporal-Difference Reinforcement Learning
This work addresses a fundamental misunderstanding in temporal-difference reinforcement learning, offering a new perspective on parameter tuning for improved learning efficiency, though it is incremental in refining existing methods.
The paper disproves the common belief that the interpolation parameter σ in Q(σ) acts as a bias-variance trade-off, showing that σ=0 minimizes variance without increasing bias, and hypothesizes a new trade-off where larger σ-values help overcome poor initializations at the expense of higher variance. It proposes Adaptive Tree Backup (ATB) methods that adjust backups based on experience, with experiments showing these adaptive strategies outperform fixed or time-annealed σ-values.
Q($σ$) is a recently proposed temporal-difference learning method that interpolates between learning from expected backups and sampled backups. It has been shown that intermediate values for the interpolation parameter $σ\in [0,1]$ perform better in practice, and therefore it is commonly believed that $σ$ functions as a bias-variance trade-off parameter to achieve these improvements. In our work, we disprove this notion, showing that the choice of $σ=0$ minimizes variance without increasing bias. This indicates that $σ$ must have some other effect on learning that is not fully understood. As an alternative, we hypothesize the existence of a new trade-off: larger $σ$-values help overcome poor initializations of the value function, at the expense of higher statistical variance. To automatically balance these considerations, we propose Adaptive Tree Backup (ATB) methods, whose weighted backups evolve as the agent gains experience. Our experiments demonstrate that adaptive strategies can be more effective than relying on fixed or time-annealed $σ$-values.