Persistent-Transient Policy Evaluation for Markov Chains via Minimal Peripheral Quotients
This work provides a theoretical foundation for disentangling persistent and transient dynamics in Markov chain evaluation, which is relevant for reinforcement learning and control in non-ergodic systems.
The paper addresses the ambiguity in policy evaluation for reducible and periodic Markov chains by identifying the peripheral invariant subspace as the source of mixing between persistent and transient effects. It proposes a minimal quotient decomposition that separates persistent regime profiles from transient components, enabling exact finite-horizon return reconstruction and stable estimation.
We study fixed-policy evaluation for finite Markov chains that may be reducible and periodic. Classical evaluation methods with gain and bias decomposition are not always diagnostic: the gain records only invariant Cesàro averages, while persistent phase-dependent behavior is absorbed into the bias together with genuinely transient effects. We identify the real peripheral invariant subspace $\mathcal{K}(P)$ of the transition matrix $P$ as the source of this ambiguity. Quotienting by $\mathcal{K}(P)$ is the minimal exact quotient that removes all non-decaying modes and makes the remaining dynamics strictly stable. After choosing a gauge projection $Π$ with kernel $\mathcal{K}(P)$, the reward admits a unique decomposition $r = g_Π^\star + (I-P)v_Π^\star$, where $g_Π^\star$ is a persistent regime profile and $v_Π^\star$ is a gauge-fixed transient component. An exact comparison with classical normalized gain and bias shows that the new pair reallocates the same information so that all persistent modes are represented in $g_Π^\star$ and $v_Π^\star$ is transient. This decomposition reconstructs finite-horizon returns, recovers statewise average reward, admits a transient-cost interpretation, and yields a stable estimator under a generative model.