CRAIApr 7

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

arXiv:2604.0643674.92 citationsh-index: 6
AI Analysis

This addresses the fundamental limitations of prompt injection defenses for language model security, showing incremental theoretical insights into why current wrapper approaches fail.

The paper proves that no continuous, utility-preserving wrapper defense can make all outputs strictly safe for language models with connected prompt spaces, establishing a defense trilemma where continuity, utility preservation, and completeness are incompatible, with results verified in Lean 4 and validated on three LLMs.

We prove that no continuous, utility-preserving wrapper defense-a function $D: X\to X$ that preprocesses inputs before the model sees them-can make all outputs strictly safe for a language model with connected prompt space, and we characterize exactly where every such defense must fail. We establish three results under successively stronger hypotheses: boundary fixation-the defense must leave some threshold-level inputs unchanged; an $ε$-robust constraint-under Lipschitz regularity, a positive-measure band around fixed boundary points remains near-threshold; and a persistent unsafe region under a transversality condition, a positive-measure subset of inputs remains strictly unsafe. These constitute a defense trilemma: continuity, utility preservation, and completeness cannot coexist. We prove parallel discrete results requiring no topology, and extend to multi-turn interactions, stochastic defenses, and capacity-parity settings. The results do not preclude training-time alignment, architectural changes, or defenses that sacrifice utility. The full theory is mechanically verified in Lean 4 and validated empirically on three LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes