OCLGMar 18, 2014

Simultaneous Perturbation Algorithms for Batch Off-Policy Search

arXiv:1403.4514v25 citations
Originality Incremental advance
AI Analysis

This work addresses policy search in batch RL for continuous domains, but it is incremental as it builds on existing off-policy evaluation methods and focuses on a simple demonstration.

The authors tackled the problem of off-policy, batch mode reinforcement learning with continuous spaces by proposing novel policy search algorithms, including first-order gradient and second-order Newton methods, and demonstrated their practicality on a simple 1D continuous state space problem.

We propose novel policy search algorithms in the context of off-policy, batch mode reinforcement learning (RL) with continuous state and action spaces. Given a batch collection of trajectories, we perform off-line policy evaluation using an algorithm similar to that by [Fonteneau et al., 2010]. Using this Monte-Carlo like policy evaluator, we perform policy search in a class of parameterized policies. We propose both first order policy gradient and second order policy Newton algorithms. All our algorithms incorporate simultaneous perturbation estimates for the gradient as well as the Hessian of the cost-to-go vector, since the latter is unknown and only biased estimates are available. We demonstrate their practicality on a simple 1-dimensional continuous state space problem.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes