HarmNet: A Framework for Adaptive Multi-Turn Jailbreak Attacks on Large Language Models
This addresses security risks in LLMs for developers and users, but it is incremental as it builds on existing jailbreak attack methods.
The paper tackles the vulnerability of large language models to multi-turn jailbreak attacks by introducing HarmNet, a modular framework that systematically explores adversarial spaces, resulting in a 99.4% attack success rate on Mistral-7B, which is 13.9% higher than baselines.
Large Language Models (LLMs) remain vulnerable to multi-turn jailbreak attacks. We introduce HarmNet, a modular framework comprising ThoughtNet, a hierarchical semantic network; a feedback-driven Simulator for iterative query refinement; and a Network Traverser for real-time adaptive attack execution. HarmNet systematically explores and refines the adversarial space to uncover stealthy, high-success attack paths. Experiments across closed-source and open-source LLMs show that HarmNet outperforms state-of-the-art methods, achieving higher attack success rates. For example, on Mistral-7B, HarmNet achieves a 99.4% attack success rate, 13.9% higher than the best baseline. Index terms: jailbreak attacks; large language models; adversarial framework; query refinement.