CLAIOct 17, 2024

MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems

arXiv:2410.13716v218 citationsh-index: 16Has CodeNAACL
Originality Incremental advance
AI Analysis

This provides a more efficient evaluation method for multilingual RAG systems, though it is incremental as it combines existing benchmark approaches.

The authors tackled the problem of evaluating retrieval-augmented generation (RAG) systems by developing MIRAGE-Bench, a synthetic arena-based benchmark for 18 languages, which uses a surrogate judge trained on heuristic metrics to predict LLM judgments, achieving a high correlation (Kendall Tau = 0.909) with GPT-4o as a teacher.

Traditional retrieval-augmented generation (RAG) benchmarks evaluate systems using heuristic-based metrics, but these require human preferences as the ground truth for reference. In contrast, arena-based benchmarks, where systems compete against each other, require an expensive large language model (LLM) as a judge for a reliable evaluation. We present a simple efficient technique to combine the best of both worlds. The idea is to train a surrogate judge using heuristic metrics as input, to output the LLM as a judge prediction. In our work, we develop MIRAGE-Bench, a synthetic arena-based RAG benchmark for 18 diverse languages on Wikipedia focused on multilingual answer generation evaluation. It extensively couples both heuristic features and LLM as a judge for evaluation. We benchmark 19 multilingual LLMs, and observe a high correlation (Kendall Tau ($τ$) = 0.909) using our surrogate judge and between GPT-4o as a teacher using the Bradley-Terry framework. Our results show proprietary and large open-source LLMs currently dominate on MIRAGE-Bench. Our code and datasets are made publicly available here: https://github.com/vectara/mirage-bench.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes