LGMay 21, 2025

Solving General-Utility Markov Decision Processes in the Single-Trial Regime with Online Planning

arXiv:2505.15782v11 citationsh-index: 1
Originality Highly original
AI Analysis

This addresses a foundational challenge in reinforcement learning for scenarios requiring single-trial evaluation, such as high-stakes or safety-critical applications, representing a novel method rather than an incremental improvement.

The paper tackles the problem of solving infinite-horizon discounted general-utility Markov decision processes (GUMDPs) in the single-trial regime, where performance is evaluated on a single trajectory, by introducing the first approach that uses online planning techniques like Monte-Carlo tree search, and demonstrates superior performance compared to baselines in experiments.

In this work, we contribute the first approach to solve infinite-horizon discounted general-utility Markov decision processes (GUMDPs) in the single-trial regime, i.e., when the agent's performance is evaluated based on a single trajectory. First, we provide some fundamental results regarding policy optimization in the single-trial regime, investigating which class of policies suffices for optimality, casting our problem as a particular MDP that is equivalent to our original problem, as well as studying the computational hardness of policy optimization in the single-trial regime. Second, we show how we can leverage online planning techniques, in particular a Monte-Carlo tree search algorithm, to solve GUMDPs in the single-trial regime. Third, we provide experimental results showcasing the superior performance of our approach in comparison to relevant baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes