LGNov 30, 2023

Handling Cost and Constraints with Off-Policy Deep Reinforcement Learning

Jared Markowitz, Jesse Silverberg, Gary Collins

MILA

arXiv:2311.18684v12.0h-index: 1

Originality Incremental advance

AI Analysis

This addresses safety-critical applications with cost constraints, offering a more reliable solution for real-world tasks, though it is incremental as it builds on existing off-policy frameworks.

The paper tackles the problem of off-policy deep reinforcement learning in environments with mixed-sign reward functions, finding that standard methods overemphasize incentives or costs due to asymmetric errors, and proposes novel actor-critic methods that outperform state-of-the-art approaches by up to significant margins in such settings.

By reusing data throughout training, off-policy deep reinforcement learning algorithms offer improved sample efficiency relative to on-policy approaches. For continuous action spaces, the most popular methods for off-policy learning include policy improvement steps where a learned state-action ($Q$) value function is maximized over selected batches of data. These updates are often paired with regularization to combat associated overestimation of $Q$ values. With an eye toward safety, we revisit this strategy in environments with "mixed-sign" reward functions; that is, with reward functions that include independent positive (incentive) and negative (cost) terms. This setting is common in real-world applications, and may be addressed with or without constraints on the cost terms. We find the combination of function approximation and a term that maximizes $Q$ in the policy update to be problematic in such environments, because systematic errors in value estimation impact the contributions from the competing terms asymmetrically. This results in overemphasis of either incentives or costs and may severely limit learning. We explore two remedies to this issue. First, consistent with prior work, we find that periodic resetting of $Q$ and policy networks can be used to reduce value estimation error and improve learning in this setting. Second, we formulate novel off-policy actor-critic methods for both unconstrained and constrained learning that do not explicitly maximize $Q$ in the policy update. We find that this second approach, when applied to continuous action spaces with mixed-sign rewards, consistently and significantly outperforms state-of-the-art methods augmented by resetting. We further find that our approach produces agents that are both competitive with popular methods overall and more reliably competent on frequently-studied control problems that do not have mixed-sign rewards.

View on arXiv PDF

Similar