LGSYSep 18, 2023

Actively Learning Reinforcement Learning: A Stochastic Optimal Control Approach

arXiv:2309.10831v4
Originality Highly original
AI Analysis

This work addresses the problem of inefficient exploration and high computational costs in reinforcement learning and control for researchers and practitioners, representing a novel integration rather than an incremental improvement.

The paper tackles the dual challenge of enabling active exploration in reinforcement learning to manage uncertainties and overcoming computational intractability in stochastic optimal control by using reinforcement learning to compute control laws. The result is a method that automatically balances caution and probing in real-time, with numerical simulations showing it outperforms traditional approaches like Linear Quadratic Regulator with certainty equivalence in terms of stability and performance.

In this paper we propose a framework towards achieving two intertwined objectives: (i) equipping reinforcement learning with active exploration and deliberate information gathering, such that it regulates state and parameter uncertainties resulting from modeling mismatches and noisy sensory; and (ii) overcoming the computational intractability of stochastic optimal control. We approach both objectives by using reinforcement learning to compute the stochastic optimal control law. On one hand, we avoid the curse of dimensionality prohibiting the direct solution of the stochastic dynamic programming equation. On the other hand, the resulting stochastic optimal control reinforcement learning agent admits caution and probing, that is, optimal online exploration and exploitation. Unlike fixed exploration and exploitation balance, caution and probing are employed automatically by the controller in real-time, even after the learning process is terminated. We conclude the paper with a numerical simulation, illustrating how a Linear Quadratic Regulator with the certainty equivalence assumption may lead to poor performance and filter divergence, while our proposed approach is stabilizing, of an acceptable performance, and computationally convenient.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes