CRMar 23

Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?

arXiv:2510.0524494.58 citationsh-index: 36
AI Analysis

This work addresses security vulnerabilities in AI agents for developers and researchers, though it is incremental as it builds on existing firewall concepts and benchmark analysis.

The paper tackles the vulnerability of AI agents to indirect prompt injection attacks by proposing a simple, modular defense using two firewalls at the agent-tool interface, achieving perfect security with high utility across four public benchmarks while revealing critical limitations in existing benchmarks. It also introduces a three-stage attack strategy to evaluate robustness beyond current attacks.

AI agents are vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external content or tool outputs cause unintended or harmful behavior. Inspired by the well-established concept of firewalls, we show that a simple, modular, and model-agnostic defense operating at the agent--tool interface achieves perfect security with high utility across all four public benchmarks: AgentDojo, Agent Security Bench, InjecAgent and tau-Bench, while achieving a state-of-the-art security--utility tradeoff compared to prior results. Specifically, we employ two firewalls: a Tool-Input Firewall (Minimizer) and a Tool-Output Firewall (Sanitizer). Unlike prior complex approaches, this defense makes minimal assumptions about the agent and can be deployed out of the box. This makes it highly generalizable while maintaining strong performance without compromising utility. Our analysis also reveals critical limitations in these existing benchmarks, including flawed success metrics, implementation bugs, and most importantly, weak attacks, hindering progress. To address this, we present targeted fixes to these issues for AgentDojo and Agent Security Bench, and propose best practices for more robust benchmark design. Moreover, we introduce a three-stage attack strategy that cascades standard prompt injection attacks, second-order attacks, and adaptive attacks to evaluate the robustness beyond existing attacks. Overall, our work shows that existing agentic security benchmarks are easily saturated by a simple approach and highlights the need for stronger benchmarks with carefully chosen evaluation metrics and strong adaptive attacks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes