ROMay 6, 2020

Guided Policy Search Model-based Reinforcement Learning for Urban Autonomous Driving

Zhuo Xu, Jianyu Chen, Masayoshi Tomizuka

arXiv:2005.03076v28.314 citations

Originality Incremental advance

AI Analysis

This work addresses sample efficiency for autonomous driving applications, representing an incremental improvement over prior imitation learning and model-free RL methods.

The authors tackled the problem of low sample efficiency in autonomous driving by introducing a model-based reinforcement learning method using guided policy search (GPS) for urban driving in the Carla simulator, achieving 100x better sample efficiency than baseline methods and learning policies for harder tasks.

In this paper, we continue our prior work on using imitation learning (IL) and model free reinforcement learning (RL) to learn driving policies for autonomous driving in urban scenarios, by introducing a model based RL method to drive the autonomous vehicle in the Carla urban driving simulator. Although IL and model free RL methods have been proved to be capable of solving lots of challenging tasks, including playing video games, robots, and, in our prior work, urban driving, the low sample efficiency of such methods greatly limits their applications on actual autonomous driving. In this work, we developed a model based RL algorithm of guided policy search (GPS) for urban driving tasks. The algorithm iteratively learns a parameterized dynamic model to approximate the complex and interactive driving task, and optimizes the driving policy under the nonlinear approximate dynamic model. As a model based RL approach, when applied in urban autonomous driving, the GPS has the advantages of higher sample efficiency, better interpretability, and greater stability. We provide extensive experiments validating the effectiveness of the proposed method to learn robust driving policy for urban driving in Carla. We also compare the proposed method with other policy search and model free RL baselines, showing 100x better sample efficiency of the GPS based RL method, and also that the GPS based method can learn policies for harder tasks that the baseline methods can hardly learn.

View on arXiv PDF

Similar