LGJan 28, 2022

Provably Efficient Primal-Dual Reinforcement Learning for CMDPs with Non-stationary Objectives and Constraints

arXiv:2201.11965v418.813 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of ensuring safety in RL for time-varying environments, which is crucial for applications like robotics and autonomous systems, though it builds incrementally on existing primal-dual methods.

The paper tackles the problem of safe reinforcement learning in constrained Markov decision processes with non-stationary objectives and constraints, proposing the PROPD-PPO algorithm that achieves provable efficiency with dynamic regret and constraint violation bounds in linear kernel and tabular settings.

We consider primal-dual-based reinforcement learning (RL) in episodic constrained Markov decision processes (CMDPs) with non-stationary objectives and constraints, which plays a central role in ensuring the safety of RL in time-varying environments. In this problem, the reward/utility functions and the state transition functions are both allowed to vary arbitrarily over time as long as their cumulative variations do not exceed certain known variation budgets. Designing safe RL algorithms in time-varying environments is particularly challenging because of the need to integrate the constraint violation reduction, safe exploration, and adaptation to the non-stationarity. To this end, we identify two alternative conditions on the time-varying constraints under which we can guarantee the safety in the long run. We also propose the \underline{P}eriodically \underline{R}estarted \underline{O}ptimistic \underline{P}rimal-\underline{D}ual \underline{P}roximal \underline{P}olicy \underline{O}ptimization (PROPD-PPO) algorithm that can coordinate with both two conditions. Furthermore, a dynamic regret bound and a constraint violation bound are established for the proposed algorithm in both the linear kernel CMDP function approximation setting and the tabular CMDP setting under two alternative conditions. This paper provides the first provably efficient algorithm for non-stationary CMDPs with safe exploration.

View on arXiv PDF

Similar