LGJul 22, 2022

Optimism in Face of a Context: Regret Guarantees for Stochastic Contextual MDP

arXiv:2207.11126v213 citationsh-index: 80
Originality Incremental advance
AI Analysis

This work addresses the problem of efficient learning in contextual MDPs for reinforcement learning researchers, offering the first optimistic approach with general function approximation, though it is incremental in extending existing methods to contextual settings.

The paper tackles regret minimization in stochastic contextual Markov Decision Processes (MDPs) under a minimum reachability assumption, using an offline least square regression oracle, and achieves a regret bound of $\widetilde{O}( (H+{1}/{p_{min}})H|S|^{3/2}\sqrt{|A|T\log(\max\{|\mathcal{G}|,|\mathcal{P}|\}/δ)})$ for the most challenging setting with unknown, context-dependent dynamics, while also providing a lower bound and an extension to $\widetilde{O}(T^{3/4})$ regret without the assumption.

We present regret minimization algorithms for stochastic contextual MDPs under minimum reachability assumption, using an access to an offline least square regression oracle. We analyze three different settings: where the dynamics is known, where the dynamics is unknown but independent of the context and the most challenging setting where the dynamics is unknown and context-dependent. For the latter, our algorithm obtains regret bound of $\widetilde{O}( (H+{1}/{p_{min}})H|S|^{3/2}\sqrt{|A|T\log(\max\{|\mathcal{G}|,|\mathcal{P}|\}/δ)})$ with probability $1-δ$, where $\mathcal{P}$ and $\mathcal{G}$ are finite and realizable function classes used to approximate the dynamics and rewards respectively, $p_{min}$ is the minimum reachability parameter, $S$ is the set of states, $A$ the set of actions, $H$ the horizon, and $T$ the number of episodes. To our knowledge, our approach is the first optimistic approach applied to contextual MDPs with general function approximation (i.e., without additional knowledge regarding the function class, such as it being linear and etc.). We present a lower bound of $Ω(\sqrt{T H |S| |A| \ln(|\mathcal{G}|)/\ln(|A|)})$, on the expected regret which holds even in the case of known dynamics. Lastly, we discuss an extension of our results to CMDPs without minimum reachability, that obtains $\widetilde{O}(T^{3/4})$ regret.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes