CL AIAug 26, 2025

Diverse And Private Synthetic Datasets Generation for RAG evaluation: A multi-agent framework

Ilias Driouich, Hongliu Cao, Eoin Thomas

arXiv:2508.18929v11 citationsh-index: 5

Originality Incremental advance

AI Analysis

This work addresses the need for more comprehensive and ethically aligned evaluation datasets for RAG systems, which is crucial for developers and researchers to ensure system effectiveness and trustworthiness, though it is incremental in focusing on dataset generation rather than broader evaluation methods.

The paper tackles the problem of evaluating retrieval-augmented generation (RAG) systems by introducing a multi-agent framework to generate synthetic QA datasets that prioritize semantic diversity and privacy preservation, achieving improved diversity and robust privacy masking on domain-specific datasets.

Retrieval-augmented generation (RAG) systems improve large language model outputs by incorporating external knowledge, enabling more informed and context-aware responses. However, the effectiveness and trustworthiness of these systems critically depends on how they are evaluated, particularly on whether the evaluation process captures real-world constraints like protecting sensitive information. While current evaluation efforts for RAG systems have primarily focused on the development of performance metrics, far less attention has been given to the design and quality of the underlying evaluation datasets, despite their pivotal role in enabling meaningful, reliable assessments. In this work, we introduce a novel multi-agent framework for generating synthetic QA datasets for RAG evaluation that prioritize semantic diversity and privacy preservation. Our approach involves: (1) a Diversity agent leveraging clustering techniques to maximize topical coverage and semantic variability, (2) a Privacy Agent that detects and mask sensitive information across multiple domains and (3) a QA curation agent that synthesizes private and diverse QA pairs suitable as ground truth for RAG evaluation. Extensive experiments demonstrate that our evaluation sets outperform baseline methods in diversity and achieve robust privacy masking on domain-specific datasets. This work offers a practical and ethically aligned pathway toward safer, more comprehensive RAG system evaluation, laying the foundation for future enhancements aligned with evolving AI regulations and compliance standards.

View on arXiv PDF

Similar