RAGAPHENE: A RAG Annotation Platform with Human Enhancements and Edits
This work addresses the need for high-quality evaluation benchmarks for LLMs in RAG contexts, which is crucial for researchers and developers aiming to reduce hallucinations and ensure factual correctness, though it is incremental as it builds on existing annotation and benchmarking methods.
The authors tackled the problem of evaluating Large Language Models (LLMs) in multi-turn Retrieval Augmented Generation (RAG) conversations by developing RAGAPHENE, a chat-based annotation platform that simulates real-world dialogues, resulting in the creation of thousands of conversations by approximately 40 annotators.
Retrieval Augmented Generation (RAG) is an important aspect of conversing with Large Language Models (LLMs) when factually correct information is important. LLMs may provide answers that appear correct, but could contain hallucinated information. Thus, building benchmarks that can evaluate LLMs on multi-turn RAG conversations has become an increasingly important task. Simulating real-world conversations is vital for producing high quality evaluation benchmarks. We present RAGAPHENE, a chat-based annotation platform that enables annotators to simulate real-world conversations for benchmarking and evaluating LLMs. RAGAPHENE has been successfully used by approximately 40 annotators to build thousands of real-world conversations.