Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs

arXiv:2605.1976838.5
AI Analysis

For researchers in reinforcement learning, this work provides the first minimax optimal regret characterization for MNL mixture MDPs, with variance-aware bounds that improve upon prior worst-case results.

This paper studies reinforcement learning for episodic MDPs with multinomial logistic transitions, proposing an algorithm that achieves regret \\(\ ilde{O}(dH^2\ar{\\sigma}_T\\sqrt{T})\\), where \\(\ar{\\sigma}_T\\) captures variance. This improves over the existing \\(\ ilde{O}(dH^2\\sqrt{T})\\) bound for structured MDPs, e.g., reducing horizon dependence by a factor \\(H\\) for KL-constrained robust MDPs. A matching lower bound is also established, proving minimax optimality.

We study reinforcement learning for episodic Markov Decision Processes (MDPs) whose transitions are modelled by a multinomial logistic (MNL) model. Existing algorithms for MNL mixture MDPs yield a regret of $\smash{\tilde{O}(dH^2\sqrt{T})}$ (Li et al., 2024), where $d$ is the feature dimension, $H$ the episode length, and $T$ the number of episodes. Inspired by the logistic bandit literature (Abeille et al., 2021; Faury et al., 2022; Boudart et al., 2026), we introduce a problem-dependent constant $\barσ\_T \leq 1/2$, measuring the normalised average variance of the optimal downstream value function along the learner's trajectory. We propose an algorithm achieving a regret of $\smash{\tilde{O}(dH^2\barσ\_T\sqrt{T})}$, which recovers the existing bound in the worst case and improves upon it for structured MDPs. For instance, for KL-constrained robust MDPs, $\barσ\_T = O(H^{-1})$, reducing the horizon dependence by a factor $H$. We further establish a matching $\smash{Ω(dH^2\barσ\_T\sqrt{T})}$ lower bound, proving minimax optimality (up to logarithmic factors) and fully characterising the regret complexity of MNL mixture MDPs for the first time.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes