LGMLAug 11, 2022

Best Policy Identification in Linear MDPs

arXiv:2208.05633v14 citationsh-index: 48
Originality Incremental advance
AI Analysis

This work addresses sample-efficient policy identification for reinforcement learning in linear MDPs, offering near-optimal algorithms that are incremental improvements over prior methods.

The paper tackles the problem of identifying the best policy in linear Markov Decision Processes with a generative model, deriving an instance-specific lower bound and devising algorithms with sample complexity upper bounds that match existing minimax and gap-dependent lower bounds, such as O(d/(ε+Δ)^2 * (log(1/δ)+d)).

We investigate the problem of best policy identification in discounted linear Markov Decision Processes in the fixed confidence setting under a generative model. We first derive an instance-specific lower bound on the expected number of samples required to identify an $\varepsilon$-optimal policy with probability $1-δ$. The lower bound characterizes the optimal sampling rule as the solution of an intricate non-convex optimization program, but can be used as the starting point to devise simple and near-optimal sampling rules and algorithms. We devise such algorithms. One of these exhibits a sample complexity upper bounded by ${\cal O}({\frac{d}{(\varepsilon+Δ)^2}} (\log(\frac{1}δ)+d))$ where $Δ$ denotes the minimum reward gap of sub-optimal actions and $d$ is the dimension of the feature space. This upper bound holds in the moderate-confidence regime (i.e., for all $δ$), and matches existing minimax and gap-dependent lower bounds. We extend our algorithm to episodic linear MDPs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes