LG MLMay 5

Optimal Posterior Sampling for Policy Identification in Tabular Markov Decision Processes

arXiv:2605.0392115.5

Predicted impact top 61% in LG · last 90 daysOriginality Highly original

AI Analysis

For researchers in reinforcement learning, this provides a computationally efficient and asymptotically optimal method for best policy identification in tabular MDPs, improving over prior algorithms like MOCA and PEDEL.

The paper tackles the problem of PAC policy identification in finite-horizon episodic tabular MDPs, proposing a randomized algorithm combining posterior sampling with online learning that achieves asymptotic optimality in sample complexity and runs in O(S^2AH) per episode, avoiding suboptimal log(1/δ) dependence.

We study the $(\varepsilon, δ)$-PAC policy identification problem in finite-horizon episodic Markov Decision Processes. Existing approaches provide finite-time guarantees for approximate settings ($\varepsilon>0$) but suffer from high computational cost, rendering them hard to implement, and also suffer from suboptimal dependence on $\log(1/δ)$. We propose a randomized and computationally efficient algorithm for best policy identification that combines posterior sampling with an online learning algorithm to guide exploration in the MDP. Our method achieves asymptotic optimality in sample complexity, also in terms of posterior contraction rate, and runs in $O(S^2AH)$ per episode, matching standard model-based approaches. Unlike prior algorithms such as MOCA and PEDEL, our guarantees remain meaningful in the asymptotic regime and avoid sub-optimal polynomial dependence on $\log(1/δ)$. Our results provide both theoretical insights and practical tools for efficient policy identification in tabular MDPs.

View on arXiv PDF

Similar