Linking PageRank, Time Reversal, and Policy Evaluation
This work offers a new theoretical bridge between reinforcement learning and network analysis, potentially enabling the use of efficient PageRank algorithms for policy evaluation in MDPs.
The paper establishes a theoretical connection between policy evaluation in Markov decision processes and PageRank, showing that the value function can be obtained from PageRank vectors via time reversal. It provides a decomposition theorem for arbitrary finite MDPs and demonstrates efficiency on a numerical example with large graphs.
We establish a connection between policy evaluation in Markov decision processes and PageRank in network analysis. For a fixed policy, we show that the value function of a discounted Markov decision process can be obtained, up to an explicit rescaling, from the PageRank vector of a suitably defined time-reversed Markov chain. In this correspondence, the discount factor plays the role of the teleportation parameter, while rewards induce the restart distribution. Beyond the irreducible case, invoking quasi-stationary distributions and Doob $h$-transforms, we prove a general decomposition theorem showing that policy evaluation for arbitrary finite MDPs reduces to a collection of PageRank problems on the recurrent and transient components of the policy-induced Markov chain. This framework naturally extends to undiscounted MDPs with terminal states and to transition-dependent rewards. We conclude by showing efficiency of our approach on a numerical example of a sticky random walk on large deterministic and random graphs.