Transferable & Stealthy Ensemble Attacks: A Black-Box Jailbreaking Framework for Large Language Models
This work addresses security vulnerabilities in aligned LLMs, though it appears incremental by building on prior jailbreaking research.
The authors tackled the problem of jailbreaking large language models by developing a black-box framework that integrates multiple attack strategies, achieving top rankings in the 2024 Competition for LLM and Agent Safety.
We present a novel black-box jailbreaking framework that integrates multiple LLM-as-Attacker strategies to deliver highly transferable and effective attacks. The framework is grounded in three key insights from prior jailbreaking research and practice: ensemble approaches outperform single methods in exposing aligned LLM vulnerabilities, malicious instructions vary in jailbreaking difficulty requiring tailored optimization, and disrupting semantic coherence of malicious prompts can manipulate their embeddings to boost success rates. Validated in the Competition for LLM and Agent Safety 2024, our solution achieved top rankings in the Jailbreaking Attack Track.