LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds
This provides a simple, efficient tool for evaluating LLM robustness and advancing alignment research, addressing a critical vulnerability in AI safety.
The paper tackles the problem of jailbreak attacks on safety-aligned LLMs by introducing LIAR, a fast, black-box, best-of-N sampling attack that matches state-of-the-art success rates while reducing perplexity by 10x and Time-to-Attack from hours to seconds.
Jailbreak attacks expose vulnerabilities in safety-aligned LLMs by eliciting harmful outputs through carefully crafted prompts. Existing methods rely on discrete optimization or trained adversarial generators, but are slow, compute-intensive, and often impractical. We argue that these inefficiencies stem from a mischaracterization of the problem. Instead, we frame jailbreaks as inference-time misalignment and introduce LIAR (Leveraging Inference-time misAlignment to jailbReak), a fast, black-box, best-of-$N$ sampling attack requiring no training. LIAR matches state-of-the-art success rates while reducing perplexity by $10\times$ and Time-to-Attack from hours to seconds. We also introduce a theoretical "safety net against jailbreaks" metric to quantify safety alignment strength and derive suboptimality bounds. Our work offers a simple yet effective tool for evaluating LLM robustness and advancing alignment research.