Reward Poisoning in Reinforcement Learning: Attacks Against Unknown Learners in Unknown Environments
This addresses security vulnerabilities in RL systems by showing that even with minimal assumptions, adversaries can effectively poison rewards, posing a threat to applications like autonomous systems and robotics.
The paper tackles the problem of black-box reward poisoning attacks in reinforcement learning, where an adversary manipulates rewards to mislead RL agents without prior knowledge of the environment or learner, and demonstrates that their U2 attack achieves near-matching performance to state-of-the-art white-box attacks.
We study black-box reward poisoning attacks against reinforcement learning (RL), in which an adversary aims to manipulate the rewards to mislead a sequence of RL agents with unknown algorithms to learn a nefarious policy in an environment unknown to the adversary a priori. That is, our attack makes minimum assumptions on the prior knowledge of the adversary: it has no initial knowledge of the environment or the learner, and neither does it observe the learner's internal mechanism except for its performed actions. We design a novel black-box attack, U2, that can provably achieve a near-matching performance to the state-of-the-art white-box attack, demonstrating the feasibility of reward poisoning even in the most challenging black-box setting.