A Subgame Perfect Equilibrium Reinforcement Learning Approach to Time-inconsistent Problems
This addresses a theoretical bottleneck in reinforcement learning for time-inconsistent problems, which is incremental as it builds on existing dynamic programming and RL methods.
The paper tackled time-inconsistent problems in reinforcement learning by developing a subgame perfect equilibrium framework and backward policy iteration algorithms, demonstrating convergence and model identifiability on a mean-variance portfolio selection problem.
In this paper, we establish a subgame perfect equilibrium reinforcement learning (SPERL) framework for time-inconsistent (TIC) problems. In the context of RL, TIC problems are known to face two main challenges: the non-existence of natural recursive relationships between value functions at different time points and the violation of Bellman's principle of optimality that raises questions on the applicability of standard policy iteration algorithms for unprovable policy improvement theorems. We adapt an extended dynamic programming theory and propose a new class of algorithms, called backward policy iteration (BPI), that solves SPERL and addresses both challenges. To demonstrate the practical usage of BPI as a training framework, we adapt standard RL simulation methods and derive two BPI-based training algorithms. We examine our derived training frameworks on a mean-variance portfolio selection problem and evaluate some performance metrics including convergence and model identifiability.