CRAIApr 29, 2025

CachePrune: Neural-Based Attribution Defense Against Indirect Prompt Injection Attacks

arXiv:2504.21228v16 citationsh-index: 11
Originality Incremental advance
AI Analysis

This work addresses a security vulnerability in LLMs that affects users relying on them for safe and robust AI interactions, representing an incremental improvement in defense mechanisms.

The paper tackles the problem of indirect prompt injection attacks on Large Language Models (LLMs), where models deviate from user instructions due to injected tasks in the prompt context, and proposes CachePrune, a defense method that identifies and prunes task-triggering neurons in the KV cache to reduce attack success rates without compromising response quality.

Large Language Models (LLMs) are identified as being susceptible to indirect prompt injection attack, where the model undesirably deviates from user-provided instructions by executing tasks injected in the prompt context. This vulnerability stems from LLMs' inability to distinguish between data and instructions within a prompt. In this paper, we propose CachePrune that defends against this attack by identifying and pruning task-triggering neurons from the KV cache of the input prompt context. By pruning such neurons, we encourage the LLM to treat the text spans of input prompt context as only pure data, instead of any indicator of instruction following. These neurons are identified via feature attribution with a loss function induced from an upperbound of the Direct Preference Optimization (DPO) objective. We show that such a loss function enables effective feature attribution with only a few samples. We further improve on the quality of feature attribution, by exploiting an observed triggering effect in instruction following. Our approach does not impose any formatting on the original prompt or introduce extra test-time LLM calls. Experiments show that CachePrune significantly reduces attack success rates without compromising the response quality. Note: This paper aims to defend against indirect prompt injection attacks, with the goal of developing more secure and robust AI systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes