Foundations of Safe Online Reinforcement Learning in the Linear Quadratic Regulator: Generalized Baselines
This work provides theoretical foundations for safe online reinforcement learning, which is crucial for real-world applications where safety constraints are paramount. It is an incremental step in understanding the interplay between safety and exploration.
This paper investigates safe online reinforcement learning in a Linear Quadratic Regulator setting where the position must remain within a safe region with high probability. The authors demonstrate that for any nonlinear baseline controller, it is possible to achieve \tilde{O}_T(\sqrt{T})-regret with sufficiently large noise support and \tilde{O}_T(T^{2/3})-regret for subgaussian noise distributions.
Many practical applications of online reinforcement learning require the satisfaction of safety constraints while learning about the unknown environment. In this work, we establish theoretical foundations for reinforcement learning with safety constraints by studying the canonical problem of Linear Quadratic Regulator learning with unknown dynamics, but with the additional constraint that the position must stay within a safe region for the entire trajectory with high probability. Our primary contribution is a general framework for studying stronger baselines of nonlinear controllers that are better suited for constrained problems than linear controllers. Due to the difficulty of analyzing non-linear controllers in a constrained problem, we focus on 1-dimensional state- and action- spaces, however we also discuss how we expect the high-level takeaways can generalize to higher dimensions. Using our framework, we show that for \emph{any} non-linear baseline satisfying natural assumptions, $\tilde{O}_T(\sqrt{T})$-regret is possible when the noise distribution has sufficiently large support, and $\tilde{O}_T(T^{2/3})$-regret is possible for \emph{any} subgaussian noise distribution. In proving these results, we introduce a new uncertainty estimation bound for nonlinear controls which shows that enforcing safety in the presence of sufficient noise can provide ``free exploration'' that compensates for the added cost of uncertainty in safety-constrained control.