AIHCOct 1, 2025

Data Quality Challenges in Retrieval-Augmented Generation

arXiv:2510.00552v11 citationsh-index: 25ICIS
Originality Synthesis-oriented
AI Analysis

This addresses data quality challenges for organizations adopting RAG systems, but it is incremental as it extends existing frameworks rather than introducing a new paradigm.

This study tackled the problem of data quality frameworks being inadequate for dynamic Retrieval-Augmented Generation systems by conducting interviews with practitioners and deriving 15 new data quality dimensions across RAG processing stages, revealing the need for front-loaded and step-aware quality management strategies.

Organizations increasingly adopt Retrieval-Augmented Generation (RAG) to enhance Large Language Models with enterprise-specific knowledge. However, current data quality (DQ) frameworks have been primarily developed for static datasets, and only inadequately address the dynamic, multi-stage nature of RAG systems. This study aims to develop DQ dimensions for this new type of AI-based systems. We conduct 16 semi-structured interviews with practitioners of leading IT service companies. Through a qualitative content analysis, we inductively derive 15 distinct DQ dimensions across the four processing stages of RAG systems: data extraction, data transformation, prompt & search, and generation. Our findings reveal that (1) new dimensions have to be added to traditional DQ frameworks to also cover RAG contexts; (2) these new dimensions are concentrated in early RAG steps, suggesting the need for front-loaded quality management strategies, and (3) DQ issues transform and propagate through the RAG pipeline, necessitating a dynamic, step-aware approach to quality management.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes