LGJul 19, 2021

Constrained Policy Gradient Method for Safe and Fast Reinforcement Learning: a Neural Tangent Kernel Based Approach

arXiv:2107.09139v21 citations
Originality Incremental advance
AI Analysis

This work addresses safety and efficiency in reinforcement learning for robotics or control applications, but it is incremental as it applies neural tangent kernels in a new context.

The paper tackles the problem of safe and fast reinforcement learning by introducing a constrained policy gradient algorithm that uses neural tangent kernels to compute policy changes and guide learning with safety constraints, demonstrating improved learning speed and transparency in Cartpole and Lunar Lander environments.

This paper presents a constrained policy gradient algorithm. We introduce constraints for safe learning with the following steps. First, learning is slowed down (lazy learning) so that the episodic policy change can be computed with the help of the policy gradient theorem and the neural tangent kernel. Then, this enables us the evaluation of the policy at arbitrary states too. In the same spirit, learning can be guided, ensuring safety via augmenting episode batches with states where the desired action probabilities are prescribed. Finally, exogenous discounted sum of future rewards (returns) can be computed at these specific state-action pairs such that the policy network satisfies constraints. Computing the returns is based on solving a system of linear equations (equality constraints) or a constrained quadratic program (inequality constraints, regional constraints). Simulation results suggest that adding constraints (external information) to the learning can improve learning in terms of speed and transparency reasonably if constraints are appropriately selected. The efficiency of the constrained learning was demonstrated with a shallow and wide ReLU network in the Cartpole and Lunar Lander OpenAI gym environments. The main novelty of the paper is giving a practical use of the neural tangent kernel in reinforcement learning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes