Reductive MDPs: A Perspective Beyond Temporal Horizons
This provides a new perspective on MDP complexity for researchers in reinforcement learning and operations research, though it appears incremental as it extends existing methods like backwards induction.
The paper tackles the computational disparity between solving general Markov decision processes (MDPs) and finite-horizon MDPs by identifying a sub-class of stochastic shortest path problems (SSPs) called reductive SSPs, showing that optimal policies can be recovered in polynomial-time for these problems.
Solving general Markov decision processes (MDPs) is a computationally hard problem. Solving finite-horizon MDPs, on the other hand, is highly tractable with well known polynomial-time algorithms. What drives this extreme disparity, and do problems exist that lie between these diametrically opposed complexities? In this paper we identify and analyse a sub-class of stochastic shortest path problems (SSPs) for general state-action spaces whose dynamics satisfy a particular drift condition. This construction generalises the traditional, temporal notion of a horizon via decreasing reachability: a property called reductivity. It is shown that optimal policies can be recovered in polynomial-time for reductive SSPs -- via an extension of backwards induction -- with an efficient analogue in reductive MDPs. The practical considerations of the proposed approach are discussed, and numerical verification provided on a canonical optimal liquidation problem.