Long$^2$RAG: Evaluating Long-Context & Long-Form Retrieval-Augmented Generation with Key Point Recall
This addresses a gap in evaluation methods for RAG systems, particularly for researchers and developers working on long-context and long-form generation tasks, though it is incremental as it builds on existing RAG benchmarking efforts.
The authors tackled the problem of inadequate benchmarks for evaluating retrieval-augmented generation (RAG) systems in long-context and long-form scenarios by introducing the Long$^2$RAG benchmark with 280 questions across 10 domains and the Key Point Recall (KPR) metric, which measures how well LLMs incorporate key points from retrieved documents averaging 2,444 words.
Retrieval-augmented generation (RAG) is a promising approach to address the limitations of fixed knowledge in large language models (LLMs). However, current benchmarks for evaluating RAG systems suffer from two key deficiencies: (1) they fail to adequately measure LLMs' capability in handling long-context retrieval due to a lack of datasets that reflect the characteristics of retrieved documents, and (2) they lack a comprehensive evaluation method for assessing LLMs' ability to generate long-form responses that effectively exploits retrieved information. To address these shortcomings, we introduce the Long$^2$RAG benchmark and the Key Point Recall (KPR) metric. Long$^2$RAG comprises 280 questions spanning 10 domains and across 8 question categories, each associated with 5 retrieved documents with an average length of 2,444 words. KPR evaluates the extent to which LLMs incorporate key points extracted from the retrieved documents into their generated responses, providing a more nuanced assessment of their ability to exploit retrieved information.