LGJun 4, 2021

Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs

arXiv:2106.02684v3103 citations
Originality Incremental advance
AI Analysis

This work addresses safety-critical applications in reinforcement learning, such as robotics or autonomous systems, by providing theoretical guarantees for constraint satisfaction, representing a significant but incremental advance over prior methods.

The paper tackles the problem of safety in reinforcement learning by proposing algorithms for constrained Markov decision processes that achieve zero or bounded constraint violation with high probability while maintaining a reward regret of order $ ilde{\mathcal{O}}(\sqrt{K})$, improving upon existing $ ilde{\mathcal{O}}(\sqrt{K})$ violation bounds.

We address the issue of safety in reinforcement learning. We pose the problem in an episodic framework of a constrained Markov decision process. Existing results have shown that it is possible to achieve a reward regret of $\tilde{\mathcal{O}}(\sqrt{K})$ while allowing an $\tilde{\mathcal{O}}(\sqrt{K})$ constraint violation in $K$ episodes. A critical question that arises is whether it is possible to keep the constraint violation even smaller. We show that when a strictly safe policy is known, then one can confine the system to zero constraint violation with arbitrarily high probability while keeping the reward regret of order $\tilde{\mathcal{O}}(\sqrt{K})$. The algorithm which does so employs the principle of optimistic pessimism in the face of uncertainty to achieve safe exploration. When no strictly safe policy is known, though one is known to exist, then it is possible to restrict the system to bounded constraint violation with arbitrarily high probability. This is shown to be realized by a primal-dual algorithm with an optimistic primal estimate and a pessimistic dual update.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes