LGFeb 2

Learning Markov Decision Processes under Fully Bandit Feedback

Zhengjia Zhuo, Anupam Gupta, Viswanath Nagarajan

arXiv:2602.02260v11.4h-index: 1

Originality Highly original

AI Analysis

This work addresses the problem of learning in episodic MDPs with highly restricted feedback for researchers and practitioners in reinforcement learning, providing an incremental step towards more realistic feedback settings.

The authors tackled the problem of learning Markov Decision Processes under fully bandit feedback, achieving a regret bound of $widetilde{O}(sqrt{T})$. Their algorithm's performance is comparable to state-of-art learning algorithms with detailed state-action feedback, despite the highly restricted feedback.

A standard assumption in Reinforcement Learning is that the agent observes every visited state-action pair in the associated Markov Decision Process (MDP), along with the per-step rewards. Strong theoretical results are known in this setting, achieving nearly-tight $Θ(\sqrt{T})$-regret bounds. However, such detailed feedback can be unrealistic, and recent research has investigated more restricted settings such as trajectory feedback, where the agent observes all the visited state-action pairs, but only a single \emph{aggregate} reward. In this paper, we consider a far more restrictive ``fully bandit'' feedback model for episodic MDPs, where the agent does not even observe the visited state-action pairs -- it only learns the aggregate reward. We provide the first efficient bandit learning algorithm for episodic MDPs with $\widetilde{O}(\sqrt{T})$ regret. Our regret has an exponential dependence on the horizon length $\H$, which we show is necessary. We also obtain improved nearly-tight regret bounds for ``ordered'' MDPs; these can be used to model classical stochastic optimization problems such as $k$-item prophet inequality and sequential posted pricing. Finally, we evaluate the empirical performance of our algorithm for the setting of $k$-item prophet inequalities; despite the highly restricted feedback, our algorithm's performance is comparable to that of a state-of-art learning algorithm (UCB-VI) with detailed state-action feedback.

View on arXiv PDF

Similar