Analysis of LLMs Against Prompt Injection and Jailbreak Attacks
This work addresses security risks for organizations deploying LLMs, but it is incremental as it evaluates existing models and defenses without proposing new solutions.
The paper analyzed prompt injection and jailbreak vulnerabilities across multiple open-source LLMs using a manually curated dataset, finding significant behavioral variations and showing that lightweight inference-time defenses mitigate straightforward attacks but are bypassed by long, reasoning-heavy prompts.
Large Language Models (LLMs) are widely deployed in real-world systems. Given their broader applicability, prompt engineering has become an efficient tool for resource-scarce organizations to adopt LLMs for their own purposes. At the same time, LLMs are vulnerable to prompt-based attacks. Thus, analyzing this risk has become a critical security requirement. This work evaluates prompt-injection and jailbreak vulnerability using a large, manually curated dataset across multiple open-source LLMs, including Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen, and Gemma variants. We observe significant behavioural variation across models, including refusal responses and complete silent non-responsiveness triggered by internal safety mechanisms. Furthermore, we evaluated several lightweight, inference-time defence mechanisms that operate as filters without any retraining or GPU-intensive fine-tuning. Although these defences mitigate straightforward attacks, they are consistently bypassed by long, reasoning-heavy prompts.