LG AIApr 17

Optimistic Policy Learning under Pessimistic Adversaries with Regret and Violation Guarantees

Sourav Ganguly, Kartik Pandit, Arnob Ghosh

arXiv:2604.1424349.7h-index: 2

AI Analysis

For safety-critical RL applications, this work provides theoretical guarantees for optimal and safe decision-making in the presence of strategic adversaries, addressing a gap in existing robust RL methods.

This paper studies safe reinforcement learning under adversarial dynamics where an exogenous adversary influences state transitions. The proposed algorithm, RHC-UCRL, achieves sub-linear regret and constraint violation guarantees, marking the first work to address safety-constrained RL with explicit adversarial policies.

Real-world decision-making systems operate in environments where state transitions depend not only on the agent's actions, but also on \textbf{exogenous factors outside its control}--competing agents, environmental disturbances, or strategic adversaries--formally, $s_{h+1} = f(s_h, a_h, \bar{a}_h)+ω_h$ where $\bar{a}_h$ is the adversary/external action, $a_h$ is the agent's action, and $ω_h$ is an additive noise. Ignoring such factors can yield policies that are optimal in isolation but \textbf{fail catastrophically in deployment}, particularly when safety constraints must be satisfied. Standard Constrained MDP formulations assume the agent is the sole driver of state evolution, an assumption that breaks down in safety-critical settings. Existing robust RL approaches address this via distributional robustness over transition kernels, but do not explicitly model the \textbf{strategic interaction} between agent and exogenous factor, and rely on strong assumptions about divergence from a known nominal model. We model the exogenous factor as an \textbf{adversarial policy} $\barπ$ that co-determines state transitions, and ask how an agent can remain both optimal and safe against such an adversary. \emph{To the best of our knowledge, this is the first work to study safety-constrained RL under explicit adversarial dynamics}. We propose \textbf{Robust Hallucinated Constrained Upper-Confidence RL} (\texttt{RHC-UCRL}), a model-based algorithm that maintains optimism over both agent and adversary policies, explicitly separating epistemic from aleatoric uncertainty. \texttt{RHC-UCRL} achieves sub-linear regret and constraint violation guarantees.

View on arXiv PDF

Similar