LG AISep 16, 2024

Reinforcement Learning with Quasi-Hyperbolic Discounting

S. R. Eshwar, Mayank Motwani, Nibedita Roy, Gugan Thoppe

arXiv:2409.10583v14.62 citationsh-index: 9

Originality Highly original

AI Analysis

This work addresses the challenge of sub-optimal returns in reinforcement learning due to time-inconsistent preferences, advancing practical applications in domains like inventory management.

The paper tackled the problem of reinforcement learning with quasi-hyperbolic discounting, which models human bias towards immediate gratification, by proposing the first model-free algorithm to find a Markov Perfect Equilibrium, and validated it numerically on a standard inventory system with stochastic demands.

Reinforcement learning has traditionally been studied with exponential discounting or the average reward setup, mainly due to their mathematical tractability. However, such frameworks fall short of accurately capturing human behavior, which has a bias towards immediate gratification. Quasi-Hyperbolic (QH) discounting is a simple alternative for modeling this bias. Unlike in traditional discounting, though, the optimal QH-policy, starting from some time $t_1,$ can be different to the one starting from $t_2.$ Hence, the future self of an agent, if it is naive or impatient, can deviate from the policy that is optimal at the start, leading to sub-optimal overall returns. To prevent this behavior, an alternative is to work with a policy anchored in a Markov Perfect Equilibrium (MPE). In this work, we propose the first model-free algorithm for finding an MPE. Using a two-timescale analysis, we show that, if our algorithm converges, then the limit must be an MPE. We also validate this claim numerically for the standard inventory system with stochastic demands. Our work significantly advances the practical application of reinforcement learning.

View on arXiv PDF

Similar