LGAIFeb 10, 2021

Defense Against Reward Poisoning Attacks in Reinforcement Learning

arXiv:2102.05776v233 citations
Originality Incremental advance
AI Analysis

This addresses security vulnerabilities in reinforcement learning systems, which is an incremental but important step for ensuring robustness in AI applications.

The paper tackles the problem of defending against reward poisoning attacks in reinforcement learning, where attackers minimally alter rewards to make a target policy uniquely optimal, and proposes an optimization framework for deriving defense policies with provable performance guarantees, including lower bounds on expected return and upper bounds on suboptimality compared to the attacker's target policy.

We study defense strategies against reward poisoning attacks in reinforcement learning. As a threat model, we consider attacks that minimally alter rewards to make the attacker's target policy uniquely optimal under the poisoned rewards, with the optimality gap specified by an attack parameter. Our goal is to design agents that are robust against such attacks in terms of the worst-case utility w.r.t. the true, unpoisoned, rewards while computing their policies under the poisoned rewards. We propose an optimization framework for deriving optimal defense policies, both when the attack parameter is known and unknown. Moreover, we show that defense policies that are solutions to the proposed optimization problems have provable performance guarantees. In particular, we provide the following bounds with respect to the true, unpoisoned, rewards: a) lower bounds on the expected return of the defense policies, and b) upper bounds on how suboptimal these defense policies are compared to the attacker's target policy. We conclude the paper by illustrating the intuitions behind our formal results, and showing that the derived bounds are non-trivial.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes