LGMLMay 21, 2018

Online Learning in Kernelized Markov Decision Processes

arXiv:1805.08052v219 citations
AI Analysis

This work addresses online learning in complex, continuous MDPs for reinforcement learning applications, representing an incremental extension of kernelized bandits to MDPs.

The authors tackled the problem of online learning in unknown, episodic Markov decision processes with continuous states and actions by developing UCRL and posterior sampling variants using nonparametric Gaussian process priors, achieving sublinear regret bounds in terms of kernel structural parameters.

We consider online learning for minimizing regret in unknown, episodic Markov decision processes (MDPs) with continuous states and actions. We develop variants of the UCRL and posterior sampling algorithms that employ nonparametric Gaussian process priors to generalize across the state and action spaces. When the transition and reward functions of the true MDP are members of the associated Reproducing Kernel Hilbert Spaces of functions induced by symmetric psd kernels (frequentist setting), we show that the algorithms enjoy sublinear regret bounds. The bounds are in terms of explicit structural parameters of the kernels, namely a novel generalization of the information gain metric from kernelized bandit, and highlight the influence of transition and reward function structure on the learning performance. Our results are applicable to multidimensional state and action spaces with composite kernel structures, and generalize results from the literature on kernelized bandits, and the adaptive control of parametric linear dynamical systems with quadratic costs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes