LGMLMay 8, 2019

Smoothing Policies and Safe Policy Gradients

arXiv:1905.03231v239 citations
AI Analysis

This addresses safety concerns for real-world control tasks like robotics, where trial-and-error learning poses risks, by providing a method to guarantee performance does not degrade during training.

The paper tackles the safety issue in policy gradient reinforcement learning by constraining the agent to never worsen performance, establishing improvement guarantees for parametric policies and identifying meta-parameter schedules that ensure monotonic improvement with high probability.

Policy Gradient (PG) algorithms are among the best candidates for the much-anticipated applications of reinforcement learning to real-world control tasks, such as robotics. However, the trial-and-error nature of these methods poses safety issues whenever the learning process itself must be performed on a physical system or involves any form of human-computer interaction. In this paper, we address a specific safety formulation, where both goals and dangers are encoded in a scalar reward signal and the learning agent is constrained to never worsen its performance, measured as the expected sum of rewards. By studying actor-only policy gradient from a stochastic optimization perspective, we establish improvement guarantees for a wide class of parametric policies, generalizing existing results on Gaussian policies. This, together with novel upper bounds on the variance of policy gradient estimators, allows us to identify meta-parameter schedules that guarantee monotonic improvement with high probability. The two key meta-parameters are the step size of the parameter updates and the batch size of the gradient estimates. Through a joint, adaptive selection of these meta-parameters, we obtain a policy gradient algorithm with monotonic improvement guarantees.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes