SY SYMar 2, 2019

Formal Policy Learning from Demonstrations for Reachability Properties

Hadi Ravanbakhsh, Sriram Sankaranarayanan, Sanjit A. Seshia

arXiv:1903.005891 citationsh-index: 72

AI Analysis

For roboticists needing formally verified, efficient controllers from complex MPC demonstrations, this method provides a way to extract simple, verified policies.

This work presents a counterexample-guided iterative framework for learning closed-loop policies from demonstrations that satisfy reachability specifications, using model-predictive controllers as demonstrators. The learned policies are formally verified and achieve orders-of-magnitude speedup over the original MPCs while maintaining correctness.

We consider the problem of learning structured, closed-loop policies (feedback laws) from demonstrations in order to control under-actuated robotic systems, so that formal behavioral specifications such as reaching a target set of states are satisfied. Our approach uses a ``counterexample-guided'' iterative loop that involves the interaction between a policy learner, a demonstrator and a verifier. The learner is responsible for querying the demonstrator in order to obtain the training data to guide the construction of a policy candidate. This candidate is analyzed by the verifier and either accepted as correct, or rejected with a counterexample. In the latter case, the counterexample is used to update the training data and further refine the policy. The approach is instantiated using receding horizon model-predictive controllers (MPCs) as demonstrators. Rather than using regression to fit a policy to the demonstrator actions, we extend the MPC formulation with the gradient of the cost-to-go function evaluated at sample states in order to constrain the set of policies compatible with the behavior of the demonstrator. We demonstrate the successful application of the resulting policy learning schemes on two case studies and we show how simple, formally-verified policies can be inferred starting from a complex and unverified nonlinear MPC implementations. As a further benefit, the policies are many orders of magnitude faster to implement when compared to the original MPCs.

View on arXiv PDF

Similar