LG SYOct 16, 2020

Policy Gradient for Continuing Tasks in Non-stationary Markov Decision Processes

Santiago Paternain, Juan Andres Bazerque, Alejandro Ribeiro

arXiv:2010.08443v13.33 citations

Originality Incremental advance

AI Analysis

This work addresses a limitation in reinforcement learning for systems requiring online adaptation to new tasks or environments, though it is incremental as it extends existing policy gradient frameworks to non-stationary settings.

The paper tackles the problem of applying policy gradient methods to continuing tasks in non-stationary Markov decision processes, establishing that stochastic gradients remain ascent directions for the initial value function, enabling convergence to critical points. A numerical example demonstrates the algorithm's ability to learn a navigation and surveillance task with a cyclic trajectory, achieving successful online adaptation without stationarity assumptions.

Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities. In this paper we consider the problem of finding optimal policies assuming that they belong to a reproducing kernel Hilbert space (RKHS). To that end we compute unbiased stochastic gradients of the value function which we use as ascent directions to update the policy. A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed. Hence preventing these algorithms to be fully implemented online, which is a desirable property for systems that need to adapt to new tasks and/or environments in deployment. The main requirement for a policy gradient algorithm to work is that the estimate of the gradient at any point in time is an ascent direction for the initial value function. In this work we establish that indeed this is the case which enables to show the convergence of the online algorithm to the critical points of the initial value function. A numerical example shows the ability of our online algorithm to learn to solve a navigation and surveillance problem, in which an agent must loop between to goal locations. This example corroborates our theoretical findings about the ascent directions of subsequent stochastic gradients. It also shows how the agent running our online algorithm succeeds in learning to navigate, following a continuing cyclic trajectory that does not comply with the standard stationarity assumptions in the literature for non episodic training.

View on arXiv PDF

Similar