[WIP] Jailbreak Paradox: The Achilles' Heel of LLMs
This addresses fundamental limitations in AI safety for developers and researchers, revealing inherent theoretical constraints in jailbreak detection.
The paper proves two paradoxes about jailbreak detection in foundation models: that a perfect jailbreak classifier is impossible, and that a weaker model cannot consistently detect jailbreaks in a stronger model, with formal proofs and a case study on Llama and GPT-4o.
We introduce two paradoxes concerning jailbreak of foundation models: First, it is impossible to construct a perfect jailbreak classifier, and second, a weaker model cannot consistently detect whether a stronger (in a pareto-dominant sense) model is jailbroken or not. We provide formal proofs for these paradoxes and a short case study on Llama and GPT4-o to demonstrate this. We discuss broader theoretical and practical repercussions of these results.