CRAIMay 8, 2025

Defending against Indirect Prompt Injection by Instruction Detection

arXiv:2505.06311v219 citationsh-index: 11Has CodeEMNLP
Originality Incremental advance
AI Analysis

This addresses a security vulnerability for users of LLMs integrated with external data, though it is an incremental improvement in defense mechanisms.

The paper tackles the problem of Indirect Prompt Injection (IPI) attacks in Large Language Models integrated with external sources, such as Retrieval-Augmented Generation, by proposing InstructDetector, a detection-based approach that achieves 99.60% in-domain and 96.90% out-of-domain detection accuracy and reduces attack success rate to 0.03% on the BIPIA benchmark.

The integration of Large Language Models (LLMs) with external sources is becoming increasingly common, with Retrieval-Augmented Generation (RAG) being a prominent example. However, this integration introduces vulnerabilities of Indirect Prompt Injection (IPI) attacks, where hidden instructions embedded in external data can manipulate LLMs into executing unintended or harmful actions. We recognize that IPI attacks fundamentally rely on the presence of instructions embedded within external content, which can alter the behavioral states of LLMs. Can the effective detection of such state changes help us defend against IPI attacks? In this paper, we propose InstructDetector, a novel detection-based approach that leverages the behavioral states of LLMs to identify potential IPI attacks. Specifically, we demonstrate the hidden states and gradients from intermediate layers provide highly discriminative features for instruction detection. By effectively combining these features, InstructDetector achieves a detection accuracy of 99.60% in the in-domain setting and 96.90% in the out-of-domain setting, and reduces the attack success rate to just 0.03% on the BIPIA benchmark. The code is publicly available at https://github.com/MYVAE/Instruction-detection.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes