CLAIFeb 2, 2024

LLMs May Perform MCQA by Selecting the Least Incorrect Option

arXiv:2402.01349v340 citationsh-index: 13COLING
AI Analysis

This addresses a reliability issue in evaluating LLMs for researchers and practitioners, though it is incremental as it builds on prior concerns about variability.

The paper identifies that LLMs may answer multiple-choice questions by selecting the least incorrect option rather than the correct one, potentially undermining MCQA as an evaluation metric, and introduces MCQA+, an enhanced dataset augmentation method to improve assessment accuracy.

In the field of NLP, Large Language Models (LLMs) have markedly enhanced performance across a variety of tasks. However, the comprehensive evaluation of LLMs remains an inevitable challenge for the community. Recently, the adoption of Multiple Choice Question Answering (MCQA) as a benchmark for assessing LLMs has gained considerable traction. However, concerns regarding the robustness of this evaluative method persist. Building upon previous discussions on the issue of \textit{variability}, we reveal an additional dimension of concern: LLMs may perform MCQA by selecting the least incorrect option rather than distinctly correct. This observation suggests that LLMs might regard multiple options as correct, which could undermine the reliability of MCQA as a metric for evaluating LLMs. To address this challenge, we introduce an enhanced dataset augmentation method for MCQA, termed MCQA+, to provide a more accurate reflection of the model performance, thereby highlighting the necessity for more sophisticated evaluation mechanisms in the assessment of LLM capabilities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes