Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs
For researchers in reinforcement learning, this work provides the first minimax optimal regret characterization for MNL mixture MDPs, with variance-aware bounds that improve upon prior worst-case results.
This paper studies reinforcement learning for episodic MDPs with multinomial logistic transitions, proposing an algorithm that achieves regret \\(\ ilde{O}(dH^2\ar{\\sigma}_T\\sqrt{T})\\), where \\(\ar{\\sigma}_T\\) captures variance. This improves over the existing \\(\ ilde{O}(dH^2\\sqrt{T})\\) bound for structured MDPs, e.g., reducing horizon dependence by a factor \\(H\\) for KL-constrained robust MDPs. A matching lower bound is also established, proving minimax optimality.
We study reinforcement learning for episodic Markov Decision Processes (MDPs) whose transitions are modelled by a multinomial logistic (MNL) model. Existing algorithms for MNL mixture MDPs yield a regret of $\smash{\tilde{O}(dH^2\sqrt{T})}$ (Li et al., 2024), where $d$ is the feature dimension, $H$ the episode length, and $T$ the number of episodes. Inspired by the logistic bandit literature (Abeille et al., 2021; Faury et al., 2022; Boudart et al., 2026), we introduce a problem-dependent constant $\barσ\_T \leq 1/2$, measuring the normalised average variance of the optimal downstream value function along the learner's trajectory. We propose an algorithm achieving a regret of $\smash{\tilde{O}(dH^2\barσ\_T\sqrt{T})}$, which recovers the existing bound in the worst case and improves upon it for structured MDPs. For instance, for KL-constrained robust MDPs, $\barσ\_T = O(H^{-1})$, reducing the horizon dependence by a factor $H$. We further establish a matching $\smash{Ω(dH^2\barσ\_T\sqrt{T})}$ lower bound, proving minimax optimality (up to logarithmic factors) and fully characterising the regret complexity of MNL mixture MDPs for the first time.