MLITLGJul 13, 2021

Model Selection for Generic Reinforcement Learning

arXiv:2107.05849v22 citations
Originality Incremental advance
AI Analysis

This addresses the problem of selecting appropriate model complexity in RL for practitioners, though it is incremental as it builds on existing UCRL methods with a novel adaptation module.

The paper tackles model selection in finite-horizon episodic reinforcement learning by proposing ARL-GEN, an algorithm that adapts to the smallest model family containing the true transition kernel, achieving a regret of ˜O(d_E^* H^2 + √(d_E^* M^* H^2 T)) that matches an oracle with prior knowledge of the model class, with an additive cost weakly dependent on T.

We address the problem of model selection for the finite horizon episodic Reinforcement Learning (RL) problem where the transition kernel $P^*$ belongs to a family of models $\mathcal{P}^*$ with finite metric entropy. In the model selection framework, instead of $\mathcal{P}^*$, we are given $M$ nested families of transition kernels $\cP_1 \subset \cP_2 \subset \ldots \subset \cP_M$. We propose and analyze a novel algorithm, namely \emph{Adaptive Reinforcement Learning (General)} (\texttt{ARL-GEN}) that adapts to the smallest such family where the true transition kernel $P^*$ lies. \texttt{ARL-GEN} uses the Upper Confidence Reinforcement Learning (\texttt{UCRL}) algorithm with value targeted regression as a blackbox and puts a model selection module at the beginning of each epoch. Under a mild separability assumption on the model classes, we show that \texttt{ARL-GEN} obtains a regret of $\Tilde{\mathcal{O}}(d_{\mathcal{E}}^*H^2+\sqrt{d_{\mathcal{E}}^* \mathbb{M}^* H^2 T})$, with high probability, where $H$ is the horizon length, $T$ is the total number of steps, $d_{\mathcal{E}}^*$ is the Eluder dimension and $\mathbb{M}^*$ is the metric entropy corresponding to $\mathcal{P}^*$. Note that this regret scaling matches that of an oracle that knows $\mathcal{P}^*$ in advance. We show that the cost of model selection for \texttt{ARL-GEN} is an additive term in the regret having a weak dependence on $T$. Subsequently, we remove the separability assumption and consider the setup of linear mixture MDPs, where the transition kernel $P^*$ has a linear function approximation. With this low rank structure, we propose novel adaptive algorithms for model selection, and obtain (order-wise) regret identical to that of an oracle with knowledge of the true model class.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes