CL AINov 26, 2025

Improving Score Reliability of Multiple Choice Benchmarks with Consistency Evaluation and Altered Answer Choices

Paulo Cavalin, Cassia Sanctos, Marcelo Grave, Claudio Pinhanez, Yago Primerano

arXiv:2511.21860v12.71 citations

Originality Incremental advance

AI Analysis

This work addresses the reliability of benchmark scores for LLM evaluators, though it is incremental as it builds on existing consistency evaluation methods.

The authors tackled the problem of unreliable scores in multiple-choice benchmarks for Large Language Models by introducing the Consistency-Rebalanced Accuracy (CoRA) metric, which adjusts scores based on response consistency using synthetically-generated questions, and demonstrated that it effectively scales down scores for inconsistent models.

In this work we present the Consistency-Rebalanced Accuracy (CoRA) metric, improving the reliability of Large Language Model (LLM) scores computed on multiple choice (MC) benchmarks. Our metric explores the response consistency of the LLMs, taking advantage of synthetically-generated questions with altered answer choices. With two intermediate scores, i.e. Bare-Minimum-Consistency Accuracy (BMCA) and Consistency Index (CI), CoRA is computed by adjusting the multiple-choice question answering (MCQA) scores to better reflect the level of consistency of the LLM. We present evaluations in different benchmarks using diverse LLMs, and not only demonstrate that LLMs can present low response consistency even when they present high MCQA scores, but also that CoRA can successfully scale down the scores of inconsistent models.

View on arXiv PDF

Similar