Reinforcement Learning with Unbiased Policy Evaluation and Linear Function Approximation
This work addresses the challenge of scaling reinforcement learning to very large MDPs, offering theoretical guarantees for practitioners in fields like robotics or AI, but it appears incremental as it builds on existing techniques without claiming major breakthroughs.
The authors tackled the problem of controlling Markov decision processes (MDPs) by analyzing two reinforcement learning algorithms that combine simulation-based policy iteration with techniques like lookahead, function approximation, and gradient descent, providing performance guarantees for these methods.
We provide performance guarantees for a variant of simulation-based policy iteration for controlling Markov decision processes that involves the use of stochastic approximation algorithms along with state-of-the-art techniques that are useful for very large MDPs, including lookahead, function approximation, and gradient descent. Specifically, we analyze two algorithms; the first algorithm involves a least squares approach where a new set of weights associated with feature vectors is obtained via least squares minimization at each iteration and the second algorithm involves a two-time-scale stochastic approximation algorithm taking several steps of gradient descent towards the least squares solution before obtaining the next iterate using a stochastic approximation algorithm.