ML LG STMay 21, 2025

Policy Testing in Markov Decision Processes

Kaito Ariu, Po-An Wang, Alexandre Proutiere, Kenshi Abe

arXiv:2505.15342v17.81 citationsh-index: 11

Originality Highly original

AI Analysis

This work addresses a fundamental verification challenge in reinforcement learning for researchers and practitioners, though it is incremental as it builds on existing pure exploration frameworks in MDPs.

The paper tackles the policy testing problem in discounted Markov decision processes (MDPs) by developing an algorithm to determine if a policy's value exceeds a threshold with minimal observations, achieving statistical optimality by matching an instance-specific lower bound on sample complexity.

We study the policy testing problem in discounted Markov decision processes (MDPs) under the fixed-confidence setting. The goal is to determine whether the value of a given policy exceeds a specified threshold while minimizing the number of observations. We begin by deriving an instance-specific lower bound that any algorithm must satisfy. This lower bound is characterized as the solution to an optimization problem with non-convex constraints. We propose a policy testing algorithm inspired by this optimization problem--a common approach in pure exploration problems such as best-arm identification, where asymptotically optimal algorithms often stem from such optimization-based characterizations. As for other pure exploration tasks in MDPs, however, the non-convex constraints in the lower-bound problem present significant challenges, raising doubts about whether statistically optimal and computationally tractable algorithms can be designed. To address this, we reformulate the lower-bound problem by interchanging the roles of the objective and the constraints, yielding an alternative problem with a non-convex objective but convex constraints. Strikingly, this reformulated problem admits an interpretation as a policy optimization task in a newly constructed reversed MDP. Leveraging recent advances in policy gradient methods, we efficiently solve this problem and use it to design a policy testing algorithm that is statistically optimal--matching the instance-specific lower bound on sample complexity--while remaining computationally tractable. We validate our approach with numerical experiments.

View on arXiv PDF

Similar