A Unified Algorithm for Stochastic Path Problems
This work addresses the challenge of efficient learning in stochastic path problems for reinforcement learning practitioners, offering incremental improvements in regret analysis and adaptation methods.
The paper tackles the problem of reinforcement learning in stochastic path problems by providing the first regret guarantees for a general class of these problems, achieving bounds that match known results for special cases like stochastic shortest path with non-positive rewards. It also introduces adaptation procedures for unknown reward scales, showing no penalty for adaptation in SSP but an unavoidable penalty in stochastic longest paths.
We study reinforcement learning in stochastic path (SP) problems. The goal in these problems is to maximize the expected sum of rewards until the agent reaches a terminal state. We provide the first regret guarantees in this general problem by analyzing a simple optimistic algorithm. Our regret bound matches the best known results for the well-studied special case of stochastic shortest path (SSP) with all non-positive rewards. For SSP, we present an adaptation procedure for the case when the scale of rewards $B_\star$ is unknown. We show that there is no price for adaptation, and our regret bound matches that with a known $B_\star$. We also provide a scale adaptation procedure for the special case of stochastic longest paths (SLP) where all rewards are non-negative. However, unlike in SSP, we show through a lower bound that there is an unavoidable price for adaptation.