AISep 30, 2025

Human-Centered Evaluation of RAG outputs: a framework and questionnaire for human-AI collaboration

arXiv:2509.26205v11 citationsh-index: 6
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of evaluating RAG outputs for developers and researchers, but it is incremental as it builds on an existing framework.

The paper tackled the lack of systematic human-centered evaluation for retrieval-augmented generation (RAG) systems by developing a questionnaire across 12 dimensions, finding that LLMs reliably focus on metric descriptions but struggle with textual format variations, while humans had difficulty adhering strictly to metric labels.

Retrieval-augmented generation (RAG) systems are increasingly deployed in user-facing applications, yet systematic, human-centered evaluation of their outputs remains underexplored. Building on Gienapp's utility-dimension framework, we designed a human-centred questionnaire that assesses RAG outputs across 12 dimensions. We iteratively refined the questionnaire through several rounds of ratings on a set of query-output pairs and semantic discussions. Ultimately, we incorporated feedback from both a human rater and a human-LLM pair. Results indicate that while large language models (LLMs) reliably focus on metric descriptions and scale labels, they exhibit weaknesses in detecting textual format variations. Humans struggled to focus strictly on metric descriptions and labels. LLM ratings and explanations were viewed as a helpful support, but numeric LLM and human ratings lacked agreement. The final questionnaire extends the initial framework by focusing on user intent, text structuring, and information verifiability.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes