LGNov 10, 2022

Safety-Constrained Policy Transfer with Successor Features

Zeyu Feng, Bowen Zhang, Jianxin Bi, Harold Soh

arXiv:2211.05361v14.68 citationsh-index: 26Has Code

Originality Incremental advance

AI Analysis

This work addresses safety-critical applications like physical robots interacting with humans, where interactions are costly and unconstrained policies can lead to dangerous outcomes, though it appears incremental as it builds on existing constrained MDP and successor features frameworks.

The paper tackles the problem of safe policy transfer in reinforcement learning by proposing a Constrained Markov Decision Process formulation that separates task goals from safety constraints, enabling the transfer of policies while adhering to safety constraints. The experiments show that the approach visits unsafe states less frequently and outperforms alternative state-of-the-art methods in simulated domains.

In this work, we focus on the problem of safe policy transfer in reinforcement learning: we seek to leverage existing policies when learning a new task with specified constraints. This problem is important for safety-critical applications where interactions are costly and unconstrained policies can lead to undesirable or dangerous outcomes, e.g., with physical robots that interact with humans. We propose a Constrained Markov Decision Process (CMDP) formulation that simultaneously enables the transfer of policies and adherence to safety constraints. Our formulation cleanly separates task goals from safety considerations and permits the specification of a wide variety of constraints. Our approach relies on a novel extension of generalized policy improvement to constrained settings via a Lagrangian formulation. We devise a dual optimization algorithm that estimates the optimal dual variable of a target task, thus enabling safe transfer of policies derived from successor features learned on source tasks. Our experiments in simulated domains show that our approach is effective; it visits unsafe states less frequently and outperforms alternative state-of-the-art methods when taking safety constraints into account.

View on arXiv PDF Code

Similar