CLAINov 26, 2025

Improving Score Reliability of Multiple Choice Benchmarks with Consistency Evaluation and Altered Answer Choices

arXiv:2511.21860v1
Originality Incremental advance
AI Analysis

This work addresses the reliability of benchmark scores for LLM evaluators, though it is incremental as it builds on existing consistency evaluation methods.

The authors tackled the problem of unreliable scores in multiple-choice benchmarks for Large Language Models by introducing the Consistency-Rebalanced Accuracy (CoRA) metric, which adjusts scores based on response consistency using synthetically-generated questions, and demonstrated that it effectively scales down scores for inconsistent models.

In this work we present the Consistency-Rebalanced Accuracy (CoRA) metric, improving the reliability of Large Language Model (LLM) scores computed on multiple choice (MC) benchmarks. Our metric explores the response consistency of the LLMs, taking advantage of synthetically-generated questions with altered answer choices. With two intermediate scores, i.e. Bare-Minimum-Consistency Accuracy (BMCA) and Consistency Index (CI), CoRA is computed by adjusting the multiple-choice question answering (MCQA) scores to better reflect the level of consistency of the LLM. We present evaluations in different benchmarks using diverse LLMs, and not only demonstrate that LLMs can present low response consistency even when they present high MCQA scores, but also that CoRA can successfully scale down the scores of inconsistent models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes