ML LGOct 7, 2025

Implicit Updates for Average-Reward Temporal Difference Learning

Hwanwoo Kim, Dongkyu Derek Cho, Eric Laber

arXiv:2510.06149v110.32 citationsh-index: 1

Originality Incremental advance

AI Analysis

This addresses a stability problem in reinforcement learning for policy evaluation and learning, offering a robust alternative to existing methods, though it is incremental as it builds on prior TD work.

The paper tackled the sensitivity of standard average-reward TD(λ) to step-size tuning by introducing an implicit variant that provides data-adaptive stabilization, resulting in improved numerical stability and reliable operation over a broader step-size range.

Temporal difference (TD) learning is a cornerstone of reinforcement learning. In the average-reward setting, standard TD($λ$) is highly sensitive to the choice of step-size and thus requires careful tuning to maintain numerical stability. We introduce average-reward implicit TD($λ$), which employs an implicit fixed point update to provide data-adaptive stabilization while preserving the per iteration computational complexity of standard average-reward TD($λ$). In contrast to prior finite-time analyses of average-reward TD($λ$), which impose restrictive step-size conditions, we establish finite-time error bounds for the implicit variant under substantially weaker step-size requirements. Empirically, average-reward implicit TD($λ$) operates reliably over a much broader range of step-sizes and exhibits markedly improved numerical stability. This enables more efficient policy evaluation and policy learning, highlighting its effectiveness as a robust alternative to average-reward TD($λ$).

View on arXiv PDF

Similar