CRAINov 28, 2024

RevPRAG: Revealing Poisoning Attacks in Retrieval-Augmented Generation through LLM Activation Analysis

arXiv:2411.18948v521 citationsh-index: 3EMNLP
Originality Highly original
AI Analysis

This addresses a security vulnerability in RAG systems used for AI applications, offering a novel detection method for a previously underexplored attack surface.

The paper tackles the problem of detecting poisoning attacks in Retrieval-Augmented Generation (RAG) systems, where malicious texts in knowledge databases lead to incorrect responses, and introduces RevPRAG, a detection pipeline using LLM activations that achieves a 98% true positive rate with a 1% false positive rate.

Retrieval-Augmented Generation (RAG) enriches the input to LLMs by retrieving information from the relevant knowledge database, enabling them to produce responses that are more accurate and contextually appropriate. It is worth noting that the knowledge database, being sourced from publicly available channels such as Wikipedia, inevitably introduces a new attack surface. RAG poisoning involves injecting malicious texts into the knowledge database, ultimately leading to the generation of the attacker's target response (also called poisoned response). However, there are currently limited methods available for detecting such poisoning attacks. We aim to bridge the gap in this work. Particularly, we introduce RevPRAG, a flexible and automated detection pipeline that leverages the activations of LLMs for poisoned response detection. Our investigation uncovers distinct patterns in LLMs' activations when generating correct responses versus poisoned responses. Our results on multiple benchmark datasets and RAG architectures show our approach could achieve 98% true positive rate, while maintaining false positive rates close to 1%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes