Using Hallucinations to Bypass GPT4's Filter
This reveals a fundamental vulnerability in LLMs that could undermine safety measures, posing risks for users relying on filtered outputs, and is not incremental as it exploits an unaddressed flaw.
The researchers tackled the problem of bypassing safety filters in large language models like GPT4 by inducing hallucinations with reversed text, which reverts the models to pre-RLHF behavior, effectively erasing filters across multiple models including GPT4 and Claude Sonnet.
Large language models (LLMs) are initially trained on vast amounts of data, then fine-tuned using reinforcement learning from human feedback (RLHF); this also serves to teach the LLM to provide appropriate and safe responses. In this paper, we present a novel method to manipulate the fine-tuned version into reverting to its pre-RLHF behavior, effectively erasing the model's filters; the exploit currently works for GPT4, Claude Sonnet, and (to some extent) for Inflection-2.5. Unlike other jailbreaks (for example, the popular "Do Anything Now" (DAN) ), our method does not rely on instructing the LLM to override its RLHF policy; hence, simply modifying the RLHF process is unlikely to address it. Instead, we induce a hallucination involving reversed text during which the model reverts to a word bucket, effectively pausing the model's filter. We believe that our exploit presents a fundamental vulnerability in LLMs currently unaddressed, as well as an opportunity to better understand the inner workings of LLMs during hallucinations.