AIJul 21, 2025

Metric assessment protocol in the context of answer fluctuation on MCQ tasks

Ekaterina Goliakova, Xavier Renard, Marie-Jeanne Lesot, Thibault Laugel, Christophe Marsala, Marcin Detyniecki

arXiv:2507.15581v15.81 citationsh-index: 11

Originality Incremental advance

AI Analysis

This work addresses the need for reliable evaluation protocols in AI benchmarking, though it is incremental as it builds on existing metrics without introducing a new paradigm.

The paper tackled the problem of evaluating LLMs on multiple-choice questions by assessing existing metrics in relation to answer fluctuation, where models produce different results with slight prompt changes, and found that worst accuracy had the highest association with fluctuation rates.

Using multiple-choice questions (MCQs) has become a standard for assessing LLM capabilities efficiently. A variety of metrics can be employed for this task. However, previous research has not conducted a thorough assessment of them. At the same time, MCQ evaluation suffers from answer fluctuation: models produce different results given slight changes in prompts. We suggest a metric assessment protocol in which evaluation methodologies are analyzed through their connection with fluctuation rates, as well as original performance. Our results show that there is a strong link between existing metrics and the answer changing, even when computed without any additional prompt variants. A novel metric, worst accuracy, demonstrates the highest association on the protocol.

View on arXiv PDF

Similar