MLLGSTMay 21, 2025

Policy Testing in Markov Decision Processes

arXiv:2505.15342v11 citationsh-index: 11
Originality Highly original
AI Analysis

This work addresses a fundamental verification challenge in reinforcement learning for researchers and practitioners, though it is incremental as it builds on existing pure exploration frameworks in MDPs.

The paper tackles the policy testing problem in discounted Markov decision processes (MDPs) by developing an algorithm to determine if a policy's value exceeds a threshold with minimal observations, achieving statistical optimality by matching an instance-specific lower bound on sample complexity.

We study the policy testing problem in discounted Markov decision processes (MDPs) under the fixed-confidence setting. The goal is to determine whether the value of a given policy exceeds a specified threshold while minimizing the number of observations. We begin by deriving an instance-specific lower bound that any algorithm must satisfy. This lower bound is characterized as the solution to an optimization problem with non-convex constraints. We propose a policy testing algorithm inspired by this optimization problem--a common approach in pure exploration problems such as best-arm identification, where asymptotically optimal algorithms often stem from such optimization-based characterizations. As for other pure exploration tasks in MDPs, however, the non-convex constraints in the lower-bound problem present significant challenges, raising doubts about whether statistically optimal and computationally tractable algorithms can be designed. To address this, we reformulate the lower-bound problem by interchanging the roles of the objective and the constraints, yielding an alternative problem with a non-convex objective but convex constraints. Strikingly, this reformulated problem admits an interpretation as a policy optimization task in a newly constructed reversed MDP. Leveraging recent advances in policy gradient methods, we efficiently solve this problem and use it to design a policy testing algorithm that is statistically optimal--matching the instance-specific lower bound on sample complexity--while remaining computationally tractable. We validate our approach with numerical experiments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes