Howard's Policy Iteration is Subexponential for Deterministic Markov Decision Problems with Rewards of Fixed Bit-size and Arbitrary Discount Factor
This provides a theoretical breakthrough for algorithm analysis in operations research and AI, addressing a fundamental gap in understanding HPI's efficiency for deterministic MDPs.
The paper tackles the long-standing exponential upper bound on Howard's Policy Iteration (HPI) for deterministic Markov Decision Problems (DMDPs), achieving a subexponential upper bound parameterized by reward bit-size and independent of discount factor, including cases with two arbitrary rewards.
Howard's Policy Iteration (HPI) is a classic algorithm for solving Markov Decision Problems (MDPs). HPI uses a "greedy" switching rule to update from any non-optimal policy to a dominating one, iterating until an optimal policy is found. Despite its introduction over 60 years ago, the best-known upper bounds on HPI's running time remain exponential in the number of states -- indeed even on the restricted class of MDPs with only deterministic transitions (DMDPs). Meanwhile, the tightest lower bound for HPI for MDPs with a constant number of actions per state is only linear. In this paper, we report a significant improvement: a subexponential upper bound for HPI on DMDPs, which is parameterised by the bit-size of the rewards, while independent of the discount factor. The same upper bound also applies to DMDPs with only two possible rewards (which may be of arbitrary size).