Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers
This work addresses security vulnerabilities in LLM agents for AI safety researchers and developers, providing a stage-level analysis that reveals critical gaps in existing defenses.
The paper tackled the problem of prompt injection attacks on LLM agents by analyzing at which pipeline stage defenses activate, finding that safety depends on whether adversarial content propagates across stages rather than just being seen. Results showed Claude had 0% attack success rate (ASR) on memory surfaces, GPT-4o-mini had 53% ASR, and DeepSeek varied from 0% to 100% ASR depending on the injection channel.
We present a stage-decomposed analysis of prompt injection attacks against five frontier LLM agents. Prior work measures task-level attack success rate (ASR); we localize the pipeline stage at which each model's defense activates. We instrument every run with a cryptographic canary token (SECRET-[A-F0-9]{8}) tracked through four kill-chain stages -- Exposed, Persisted, Relayed, Executed -- across four attack surfaces and five defense conditions (764 total runs, 428 no-defense attacked). Our central finding is that model safety is determined not by whether adversarial content is seen, but by whether it is propagated across pipeline stages. Concretely: (1) in our evaluation, exposure is 100% for all five models -- the safety gap is entirely downstream; (2) Claude strips injections at write_memory summarization (0/164 ASR), while GPT-4o-mini propagates canaries without loss (53% ASR, 95% CI: 41--65%); (3) DeepSeek exhibits 0% ASR on memory surfaces and 100% ASR on tool-stream surfaces from the same model -- a complete reversal across injection channels; (4) all four active defense conditions (write_filter, pi_detector, spotlighting, and their combination) produce 100% ASR due to threat-model surface mismatch; (5) a Claude relay node decontaminates downstream agents -- 0/40 canaries survived into shared memory.