Privacy-Preserving Retrieval-Augmented Generation with Differential Privacy
This addresses privacy risks for users handling sensitive external data in RAG systems, though it is incremental as it builds on existing DP and RAG methods.
The paper tackles the problem of privacy leakage in retrieval-augmented generation (RAG) for large language models by applying differential privacy, proposing an algorithm that allocates privacy budget only to tokens needing sensitive information. The result shows the algorithm outperforms non-RAG baselines under a privacy budget of ε≈10 across various models and datasets.
With the recent remarkable advancement of large language models (LLMs), there has been a growing interest in utilizing them in the domains with highly sensitive data that lies outside their training data. For this purpose, retrieval-augmented generation (RAG) is particularly effective -- it assists LLMs by directly providing relevant information from the external knowledge sources. However, without extra privacy safeguards, RAG outputs risk leaking sensitive information from the external data source. In this work, we explore RAG under differential privacy (DP), a formal guarantee of data privacy. The main challenge with differentially private RAG is how to generate long accurate answers within a moderate privacy budget. We address this by proposing an algorithm that smartly spends privacy budget only for the tokens that require the sensitive information and uses the non-private LLM for other tokens. Our extensive empirical evaluations reveal that our algorithm outperforms the non-RAG baseline under a reasonable privacy budget of $ε\approx 10$ across different models and datasets.