LGOCPRMLDec 19, 2021

Exploration-exploitation trade-off for continuous-time episodic reinforcement learning with linear-convex models

arXiv:2112.10264v131 citations
Originality Highly original
AI Analysis

This work addresses the exploration-exploitation trade-off for researchers in reinforcement learning and control theory, offering incremental improvements by extending results from linear-quadratic to linear-convex models with specific regret bounds.

The paper tackles the problem of model-based reinforcement learning for continuous-time episodic control with unknown linear dynamics and convex objectives, establishing conditions for a quadratic performance gap between estimated and true models and proposing a phase-based algorithm that achieves sublinear regret, specifically O(√N ln N) high probability regret in general cases and O((ln N)^2) expected regret in self-exploration cases over N episodes.

We develop a probabilistic framework for analysing model-based reinforcement learning in the episodic setting. We then apply it to study finite-time horizon stochastic control problems with linear dynamics but unknown coefficients and convex, but possibly irregular, objective function. Using probabilistic representations, we study regularity of the associated cost functions and establish precise estimates for the performance gap between applying optimal feedback control derived from estimated and true model parameters. We identify conditions under which this performance gap is quadratic, improving the linear performance gap in recent work [X. Guo, A. Hu, and Y. Zhang, arXiv preprint, arXiv:2104.09311, (2021)], which matches the results obtained for stochastic linear-quadratic problems. Next, we propose a phase-based learning algorithm for which we show how to optimise exploration-exploitation trade-off and achieve sublinear regrets in high probability and expectation. When assumptions needed for the quadratic performance gap hold, the algorithm achieves an order $\mathcal{O}(\sqrt{N} \ln N)$ high probability regret, in the general case, and an order $\mathcal{O}((\ln N)^2)$ expected regret, in self-exploration case, over $N$ episodes, matching the best possible results from the literature. The analysis requires novel concentration inequalities for correlated continuous-time observations, which we derive.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes