LGAIJan 31, 2023

Policy Gradient for Rectangular Robust Markov Decision Processes

arXiv:2301.13589v237 citationsh-index: 81
Originality Incremental advance
AI Analysis

This addresses the computational expense of training robust policies for reinforcement learning agents, though it is incremental as it builds on existing policy gradient frameworks.

The paper tackles the problem of efficiently learning robust policies in reinforcement learning under transition uncertainty, introducing a robust policy gradient method that achieves the same time complexity as non-robust methods.

Policy gradient methods have become a standard for training reinforcement learning agents in a scalable and efficient manner. However, they do not account for transition uncertainty, whereas learning robust policies can be computationally expensive. In this paper, we introduce robust policy gradient (RPG), a policy-based method that efficiently solves rectangular robust Markov decision processes (MDPs). We provide a closed-form expression for the worst occupation measure. Incidentally, we find that the worst kernel is a rank-one perturbation of the nominal. Combining the worst occupation measure with a robust Q-value estimation yields an explicit form of the robust gradient. Our resulting RPG can be estimated from data with the same time complexity as its non-robust equivalent. Hence, it relieves the computational burden of convex optimization problems required for training robust policies by current policy gradient approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes