Adaptive Lambda Least-Squares Temporal Difference Learning
This addresses a practical bottleneck in reinforcement learning for practitioners by automating parameter selection, though it is incremental as it builds on existing LSTD methods.
The paper tackles the problem of automatically tuning the λ parameter in Temporal Difference learning by formalizing it as a bias-variance trade-off and proposing ALLSTD, an efficient algorithm that uses Leave-One-Trajectory-Out Cross-Validation with function optimization, achieving similar performance to a naive implementation while being significantly faster computationally.
Temporal Difference learning or TD($λ$) is a fundamental algorithm in the field of reinforcement learning. However, setting TD's $λ$ parameter, which controls the timescale of TD updates, is generally left up to the practitioner. We formalize the $λ$ selection problem as a bias-variance trade-off where the solution is the value of $λ$ that leads to the smallest Mean Squared Value Error (MSVE). To solve this trade-off we suggest applying Leave-One-Trajectory-Out Cross-Validation (LOTO-CV) to search the space of $λ$ values. Unfortunately, this approach is too computationally expensive for most practical applications. For Least Squares TD (LSTD) we show that LOTO-CV can be implemented efficiently to automatically tune $λ$ and apply function optimization methods to efficiently search the space of $λ$ values. The resulting algorithm, ALLSTD, is parameter free and our experiments demonstrate that ALLSTD is significantly computationally faster than the naïve LOTO-CV implementation while achieving similar performance.