MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks
This addresses the need for rigorous evaluation of LLM benchmarks, particularly in high-stakes domains like cybersecurity, though it is incremental as it applies existing meta-evaluation concepts to a specific area.
The paper tackles the lack of standardized meta-evaluation for QA benchmarks by proposing MEQA, a framework that provides quantifiable scores and comparisons, demonstrated on cybersecurity benchmarks with human and LLM evaluators to highlight strengths and weaknesses.
As Large Language Models (LLMs) advance, their potential for widespread societal impact grows simultaneously. Hence, rigorous LLM evaluations are both a technical necessity and social imperative. While numerous evaluation benchmarks have been developed, there remains a critical gap in meta-evaluation: effectively assessing benchmarks' quality. We propose MEQA, a framework for the meta-evaluation of question and answer (QA) benchmarks, to provide standardized assessments, quantifiable scores, and enable meaningful intra-benchmark comparisons. We demonstrate this approach on cybersecurity benchmarks, using human and LLM evaluators, highlighting the benchmarks' strengths and weaknesses. We motivate our choice of test domain by AI models' dual nature as powerful defensive tools and security threats.