AI CR LGSep 5, 2022

Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL Agents

Stephen Casper, Taylor Killian, Gabriel Kreiman, Dylan Hadfield-Menell

arXiv:2209.02167v36.21 citationsh-index: 31Has Code

Originality Highly original

AI Analysis

This work addresses security risks in AI systems for developers and researchers, but it is incremental as it builds on prior black-box adversarial policy research.

The paper tackles the problem of identifying vulnerabilities in reinforcement learning agents by introducing white-box adversarial policies that access the target's internal state, demonstrating higher performance against targets compared to black-box methods in 2-player games and text-generating language models.

Adversarial examples can be useful for identifying vulnerabilities in AI systems before they are deployed. In reinforcement learning (RL), adversarial policies can be developed by training an adversarial agent to minimize a target agent's rewards. Prior work has studied black-box versions of these attacks where the adversary only observes the world state and treats the target agent as any other part of the environment. However, this does not take into account additional structure in the problem. In this work, we study white-box adversarial policies and show that having access to a target agent's internal state can be useful for identifying its vulnerabilities. We make two contributions. (1) We introduce white-box adversarial policies where an attacker observes both a target's internal state and the world state at each timestep. We formulate ways of using these policies to attack agents in 2-player games and text-generating language models. (2) We demonstrate that these policies can achieve higher initial and asymptotic performance against a target agent than black-box controls. Code is available at https://github.com/thestephencasper/lm_white_box_attacks

View on arXiv PDF Code

Similar