CRAILGApr 15, 2025

Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks against Prompt Injection and Jailbreak Detection Systems

arXiv:2504.11168v317 citationsh-index: 27
Originality Highly original
AI Analysis

This reveals critical security flaws in LLM protection mechanisms, posing risks for users relying on these systems to prevent malicious prompt injections and jailbreaks.

The paper tackled the vulnerability of LLM guardrail systems to evasion attacks, demonstrating that both character injection and adversarial machine learning techniques can bypass detection with up to 100% success rate against six prominent systems.

Large Language Models (LLMs) guardrail systems are designed to protect against prompt injection and jailbreak attacks. However, they remain vulnerable to evasion techniques. We demonstrate two approaches for bypassing LLM prompt injection and jailbreak detection systems via traditional character injection methods and algorithmic Adversarial Machine Learning (AML) evasion techniques. Through testing against six prominent protection systems, including Microsoft's Azure Prompt Shield and Meta's Prompt Guard, we show that both methods can be used to evade detection while maintaining adversarial utility achieving in some instances up to 100% evasion success. Furthermore, we demonstrate that adversaries can enhance Attack Success Rates (ASR) against black-box targets by leveraging word importance ranking computed by offline white-box models. Our findings reveal vulnerabilities within current LLM protection mechanisms and highlight the need for more robust guardrail systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes