LG OCJun 19, 2024

Infinite-Horizon Reinforcement Learning with Multinomial Logistic Function Approximation

arXiv:2406.13633v33 citations

AI Analysis

This work addresses efficient learning in model-based RL with non-linear approximations, providing theoretical guarantees for infinite-horizon problems, though it is incremental in extending existing methods to specific function classes.

The paper tackles reinforcement learning with multinomial logistic function approximation, developing an algorithm with provable regret bounds of $ ilde{\mathcal{O}}(dD\sqrt{T})$ for average-reward and $ ilde{\mathcal{O}}(d(1-\gamma)^{-2}\sqrt{T})$ for discounted-reward settings, and complements these with matching or improved lower bounds.

We study model-based reinforcement learning with non-linear function approximation where the transition function of the underlying Markov decision process (MDP) is given by a multinomial logistic (MNL) model. We develop a provably efficient discounted value iteration-based algorithm that works for both infinite-horizon average-reward and discounted-reward settings. For average-reward communicating MDPs, the algorithm guarantees a regret upper bound of $\tilde{\mathcal{O}}(dD\sqrt{T})$ where $d$ is the dimension of feature mapping, $D$ is the diameter of the underlying MDP, and $T$ is the horizon. For discounted-reward MDPs, our algorithm achieves $\tilde{\mathcal{O}}(d(1-γ)^{-2}\sqrt{T})$ regret where $γ$ is the discount factor. Then we complement these upper bounds by providing several regret lower bounds. We prove a lower bound of $Ω(d\sqrt{DT})$ for learning communicating MDPs of diameter $D$ and a lower bound of $Ω(d(1-γ)^{3/2}\sqrt{T})$ for learning discounted-reward MDPs with discount factor $γ$. Lastly, we show a regret lower bound of $Ω(dH^{3/2}\sqrt{K})$ for learning $H$-horizon episodic MDPs with MNL function approximation where $K$ is the number of episodes, which improves upon the best-known lower bound for the finite-horizon setting.

View on arXiv PDF

Similar