LGAISYNov 9, 2018

Sample-Efficient Policy Learning based on Completely Behavior Cloning

arXiv:1811.03853v1
Originality Synthesis-oriented
AI Analysis

This work addresses sample efficiency and safety issues in reinforcement learning for agents, though it appears incremental as it builds on existing MPC methods.

The paper tackles the challenges of direct policy search in reinforcement learning, such as high data requirements and poor local optima, by proposing PLCBC, a policy initialization algorithm that clones an MPC controller without performance loss, enabling faster and better convergence in experiments.

Direct policy search is one of the most important algorithm of reinforcement learning. However, learning from scratch needs a large amount of experience data and can be easily prone to poor local optima. In addition to that, a partially trained policy tends to perform dangerous action to agent and environment. In order to overcome these challenges, this paper proposed a policy initialization algorithm called Policy Learning based on Completely Behavior Cloning (PLCBC). PLCBC first transforms the Model Predictive Control (MPC) controller into a piecewise affine (PWA) function using multi-parametric programming, and uses a neural network to express this function. By this way, PLCBC can completely clone the MPC controller without any performance loss, and is totally training-free. The experiments show that this initialization strategy can help agent learn at the high reward state region, and converge faster and better.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes