LGMLFeb 15, 2020

Loop Estimator for Discounted Values in Markov Reward Processes

arXiv:2002.06299v31 citations
AI Analysis

This work addresses efficiency in policy evaluation for reinforcement learning, offering a method with lower space complexity than existing approaches, though it appears incremental as it builds on regenerative structures without a paradigm shift.

The authors tackled the problem of estimating discounted state values in Markov reward processes by proposing a loop estimator that exploits regenerative structure, achieving O(1) space complexity for a single state and an instance-dependent convergence rate of ~O(√(τ_s/T)).

At the working heart of policy iteration algorithms commonly used and studied in the discounted setting of reinforcement learning, the policy evaluation step estimates the value of states with samples from a Markov reward process induced by following a Markov policy in a Markov decision process. We propose a simple and efficient estimator called loop estimator that exploits the regenerative structure of Markov reward processes without explicitly estimating a full model. Our method enjoys a space complexity of $O(1)$ when estimating the value of a single positive recurrent state $s$ unlike TD with $O(S)$ or model-based methods with $O\left(S^2\right)$. Moreover, the regenerative structure enables us to show, without relying on the generative model approach, that the estimator has an instance-dependent convergence rate of $\widetilde{O}\left(\sqrt{τ_s/T}\right)$ over steps $T$ on a single sample path, where $τ_s$ is the maximal expected hitting time to state $s$. In preliminary numerical experiments, the loop estimator outperforms model-free methods, such as TD(k), and is competitive with the model-based estimator.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes