CRMay 25

AgentSecBench: Measuring Prompt Injection, Privacy Leakage, and Tool-Use Integrity in LLM Agents

arXiv:2605.2626946.4
Predicted impact top 40% in CR · last 90 daysOriginality Highly original
AI Analysis

Provides a principled security evaluation method for LLM agent systems, addressing the conflation of data flow and authority that leads to prompt injection, privacy leakage, and tool-use integrity violations.

AgentSecBench introduces a formal security framework for LLM agents that defines three security games (instruction-integrity, retrieval-confidentiality, capability-integrity) under a common noninterference notion. Evaluating six defense classes with Qwen3 models shows that risk reduction correlates with channel closure, but model-visible adversarial capabilities can remain exploitable.

LLM agents process trusted instructions, retrieved records, and tool observations through a common generative channel. This conflates data flow with authority: an untrusted string can affect a secret-bearing response or an action proposal even when no application policy authorizes that influence. We introduce AgentSecBench as an empirical instantiation of a formal security framework for this problem. The framework defines three games-instruction-integrity, retrieval-confidentiality, and capability-integrity-under a common notion of intent-to-execution noninterference with permitted leakage. It represents an application policy as a projection onto authorized observations and capabilities, distinguishes prompt annotations from enforcing projections, and measures both adversarial advantage and whether a defense closes the relevant model-visible channel before generation. The exact-marker experiments are intentionally one observable instantiation of the games rather than a complete semantic security claim: they test disclosure and forbidden-action distinguishers with unambiguous ground truth. We evaluate six defense classes with Qwen3-0.6B and Qwen3-1.7B on paired adversarial and benign-control executions. The measurements show when risk reduction follows channel closure and when a model-visible adversarial capability remains exploitable. The result is a security-oriented evaluation method: prompt text can describe a boundary, whereas provenance projection, capability restriction, and output validation can enforce one.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes