LG AISep 7, 2025

Teaching Precommitted Agents: Model-Free Policy Evaluation and Control in Quasi-Hyperbolic Discounted MDPs

arXiv:2509.06094v14.1h-index: 1CDC

Originality Highly original

AI Analysis

It addresses a key theoretical and algorithmic gap for precommitted agents with QH preferences, providing foundational insights for incorporating human-like decision-making into RL.

This paper tackles the integration of Quasi-Hyperbolic discounting into reinforcement learning for time-inconsistent preferences, proving that the optimal policy reduces to a simple one-step non-stationary form and designing the first practical, model-free algorithms with provable convergence guarantees.

Time-inconsistent preferences, where agents favor smaller-sooner over larger-later rewards, are a key feature of human and animal decision-making. Quasi-Hyperbolic (QH) discounting provides a simple yet powerful model for this behavior, but its integration into the reinforcement learning (RL) framework has been limited. This paper addresses key theoretical and algorithmic gaps for precommitted agents with QH preferences. We make two primary contributions: (i) we formally characterize the structure of the optimal policy, proving for the first time that it reduces to a simple one-step non-stationary form; and (ii) we design the first practical, model-free algorithms for both policy evaluation and Q-learning in this setting, both with provable convergence guarantees. Our results provide foundational insights for incorporating QH preferences in RL.

View on arXiv PDF

Similar