Testing the Limits of Jailbreaking Defenses with the Purple Problem
This work highlights a critical limitation in current safety enforcement mechanisms for language models, which is important for researchers and developers focused on AI safety, though it is incremental in testing existing methods rather than proposing new ones.
The paper tackles the problem of evaluating jailbreak defenses in language models by testing them on a simple, well-specified definition of unsafe outputs (containing the word 'purple'), and finds that existing fine-tuning and input defenses fail, casting doubt on their robustness for more complex definitions.
The rise of "jailbreak" attacks on language models has led to a flurry of defenses aimed at preventing undesirable responses. We critically examine the two stages of the defense pipeline: (i) defining what constitutes unsafe outputs, and (ii) enforcing the definition via methods such as input processing or fine-tuning. To test the efficacy of existing enforcement mechanisms, we consider a simple and well-specified definition of unsafe outputs--outputs that contain the word "purple". Surprisingly, existing fine-tuning and input defenses fail on this simple problem, casting doubt on whether enforcement algorithms can be robust for more complicated definitions. We find that real safety benchmarks similarly test enforcement for a fixed definition. We hope that future research can lead to effective/fast enforcement as well as high quality definitions used for enforcement and evaluation.