CRMay 22

Prompt Overflow: What the Guardrail Inspects Is Not What the Model Infers

Yuanbo Zhou, Changjia Zhu, Junyu Wang, Xu He, Yan Zhai, Kun Sun, Mingkui Wei, Junjie Xiong

arXiv:2605.2319666.91 citations

AI Analysis

For security practitioners deploying guardrail models, this work reveals a critical vulnerability that allows prompt injection attacks to evade safety checks, highlighting the need for improved long-input handling.

The paper identifies a blind spot in guardrail models due to mismatch between their limited inspection windows and LLMs' larger context windows, and introduces a Prompt Overflow Attack that fragments malicious instructions across overlong prompts to evade detection. The attack successfully bypasses state-of-the-art guardrails like Meta Llama Prompt Guard and IBM Granite Guardian, while prompts remain actionable by LLMs.

Guardrail models (a.k.a. safety checkers) are widely deployed to screen user inputs before they reach large language models (LLMs), serving as a primary defense against prompt injection attacks. Due to strict context constraints, these models handle overlength prompts through truncation or segmentation-based inspection. While prior work has focused on semantic adversarial inputs, the security implications of these long-input processing mechanisms remain largely unexplored. In this paper, we identify a critical blind spot arising from the mismatch between the limited inspection windows of guardrail models and the substantially larger context inference windows of downstream LLMs. We introduce a novel Prompt Overflow Attack, which exploits this mismatch by fragmenting malicious instructions and interleaving them with benign filler content across an overlong prompt, such that no individual inspected segment appears malicious while the full context remains actionable to the LLM. Through a systematic evaluation against state-of-the-art guardrail models, including Meta Llama Prompt Guard, IBM Granite Guardian, and DeBERTa-based detectors, we demonstrate that prompts reliably detected in short-context settings can evade guardrail models once adversarially manipulated into over-length inputs, yet remain fully actionable by downstream LLMs. We further propose potential defense strategies and outline mitigation directions to strengthen guardrail models.

View on arXiv PDF

Similar