CR AI CLAug 12, 2025

Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs

arXiv:2508.09288v21 citationsh-index: 1

Originality Highly original

AI Analysis

This addresses a critical security problem for users of LLMs by offering a provable, drop-in protection method against prompt injection attacks, though it is an incremental improvement over existing heuristic guardrails.

The paper tackles the vulnerability of large language models to prompt injection and jailbreak attacks by introducing Contextual Integrity Verification (CIV), a security architecture that provides deterministic, per-token non-interference guarantees, achieving a 0% attack success rate on benchmarks while preserving 93.1% token-level similarity and no degradation in perplexity on benign tasks.

Large language models (LLMs) remain acutely vulnerable to prompt injection and related jailbreak attacks; heuristic guardrails (rules, filters, LLM judges) are routinely bypassed. We present Contextual Integrity Verification (CIV), an inference-time security architecture that attaches cryptographically signed provenance labels to every token and enforces a source-trust lattice inside the transformer via a pre-softmax hard attention mask (with optional FFN/residual gating). CIV provides deterministic, per-token non-interference guarantees on frozen models: lower-trust tokens cannot influence higher-trust representations. On benchmarks derived from recent taxonomies of prompt-injection vectors (Elite-Attack + SoK-246), CIV attains 0% attack success rate under the stated threat model while preserving 93.1% token-level similarity and showing no degradation in model perplexity on benign tasks; we note a latency overhead attributable to a non-optimized data path. Because CIV is a lightweight patch -- no fine-tuning required -- we demonstrate drop-in protection for Llama-3-8B and Mistral-7B. We release a reference implementation, an automated certification harness, and the Elite-Attack corpus to support reproducible research.

View on arXiv PDF

Similar