When Can You Poison Rewards? A Tight Characterization of Reward Poisoning in Linear MDPs
For RL security researchers, this work establishes a fundamental theoretical limit on reward poisoning, distinguishing vulnerable from robust instances.
This paper provides the first precise characterization of when reward poisoning attacks can succeed in linear MDPs, drawing a clear boundary between vulnerable and intrinsically robust RL instances. The theory extends to deep RL environments, enabling effective identification and attack of vulnerable cases.
We study reward poisoning attacks in reinforcement learning (RL), where an adversary manipulates rewards within constrained budgets to force the target RL agent to adopt a policy that aligns with the attacker's objectives. Prior works on reward poisoning mainly focused on sufficient conditions to design a successful attacker, while only a few studies discussed the infeasibility of targeted attacks. This paper provides the first precise necessity and sufficiency characterization of the attackability of a linear MDP under reward poisoning attacks. Our characterization draws a bright line between the vulnerable RL instances, and the intrinsically robust ones which cannot be attacked without large costs even running vanilla non-robust RL algorithms. Our theory extends beyond linear MDPs -- by approximating deep RL environments as linear MDPs, we show that our theoretical framework effectively distinguishes the attackability and efficiently attacks the vulnerable ones, demonstrating both the theoretical and practical significance of our characterization.