AICLCRLGNov 24, 2023

Universal Jailbreak Backdoors from Poisoned Human Feedback

ETH Zurich
arXiv:2311.14455v4135 citationsh-index: 52
Originality Incremental advance
AI Analysis

This addresses a security vulnerability in AI alignment for users of RLHF-trained models, though it is incremental as it builds on prior jailbreak and backdoor research.

The paper tackles the threat of embedding a universal jailbreak backdoor into large language models by poisoning RLHF training data, resulting in a trigger word that enables harmful responses without adversarial search, and finds these backdoors are significantly harder to plant than previous methods.

Reinforcement Learning from Human Feedback (RLHF) is used to align large language models to produce helpful and harmless responses. Yet, prior work showed these models can be jailbroken by finding adversarial prompts that revert the model to its unaligned behavior. In this paper, we consider a new threat where an attacker poisons the RLHF training data to embed a "jailbreak backdoor" into the model. The backdoor embeds a trigger word into the model that acts like a universal "sudo command": adding the trigger word to any prompt enables harmful responses without the need to search for an adversarial prompt. Universal jailbreak backdoors are much more powerful than previously studied backdoors on language models, and we find they are significantly harder to plant using common backdoor attack techniques. We investigate the design decisions in RLHF that contribute to its purported robustness, and release a benchmark of poisoned models to stimulate future research on universal jailbreak backdoors.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes