Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs

Pierre Boudart, Pierre Gaillard, Alessandro Rudi

arXiv:2605.1976838.5

AI Analysis

For researchers in reinforcement learning, this work provides the first minimax optimal regret characterization for MNL mixture MDPs, with variance-aware bounds that improve upon prior worst-case results.

This paper studies reinforcement learning for episodic MDPs with multinomial logistic transitions, proposing an algorithm that achieves regret \$\ ilde{O}(dH^2\ar{\\sigma}_T\\sqrt{T})\$, where \$\ar{\\sigma}_T\$ captures variance. This improves over the existing \$\ ilde{O}(dH^2\\sqrt{T})\$ bound for structured MDPs, e.g., reducing horizon dependence by a factor \$H\$ for KL-constrained robust MDPs. A matching lower bound is also established, proving minimax optimality.

We study reinforcement learning for episodic Markov Decision Processes (MDPs) whose transitions are modelled by a multinomial logistic (MNL) model. Existing algorithms for MNL mixture MDPs yield a regret of $\smash{\tilde{O}(dH^2\sqrt{T})}$ (Li et al., 2024), where $d$ is the feature dimension, $H$ the episode length, and $T$ the number of episodes. Inspired by the logistic bandit literature (Abeille et al., 2021; Faury et al., 2022; Boudart et al., 2026), we introduce a problem-dependent constant $\barσ\_T \leq 1/2$, measuring the normalised average variance of the optimal downstream value function along the learner's trajectory. We propose an algorithm achieving a regret of $\smash{\tilde{O}(dH^2\barσ\_T\sqrt{T})}$, which recovers the existing bound in the worst case and improves upon it for structured MDPs. For instance, for KL-constrained robust MDPs, $\barσ\_T = O(H^{-1})$, reducing the horizon dependence by a factor $H$. We further establish a matching $\smash{Ω(dH^2\barσ\_T\sqrt{T})}$ lower bound, proving minimax optimality (up to logarithmic factors) and fully characterising the regret complexity of MNL mixture MDPs for the first time.

View on arXiv PDF

Similar