CLFeb 19, 2024

Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?

CMU
arXiv:2402.12483v271 citationsh-index: 27ACL
Originality Incremental advance
AI Analysis

This work addresses the problem of fair evaluation in MCQA benchmarks for AI researchers, highlighting the need for stronger baselines and robust datasets, though it is incremental in nature.

The study investigated whether large language models (LLMs) can answer multiple-choice questions using only the answer choices, without the question, and found that this approach outperformed a majority baseline in 11 out of 12 cases, with accuracy gains up to 0.33.

Multiple-choice question answering (MCQA) is often used to evaluate large language models (LLMs). To see if MCQA assesses LLMs as intended, we probe if LLMs can perform MCQA with choices-only prompts, where models must select the correct answer only from the choices. In three MCQA datasets and four LLMs, this prompt bests a majority baseline in 11/12 cases, with up to 0.33 accuracy gain. To help explain this behavior, we conduct an in-depth, black-box analysis on memorization, choice dynamics, and question inference. Our key findings are threefold. First, we find no evidence that the choices-only accuracy stems from memorization alone. Second, priors over individual choices do not fully explain choices-only accuracy, hinting that LLMs use the group dynamics of choices. Third, LLMs have some ability to infer a relevant question from choices, and surprisingly can sometimes even match the original question. Inferring the original question is an impressive reasoning strategy, but it cannot fully explain the high choices-only accuracy of LLMs in MCQA. Thus, while LLMs are not fully incapable of reasoning in MCQA, we still advocate for the use of stronger baselines in MCQA benchmarks, the design of robust MCQA datasets for fair evaluations, and further efforts to explain LLM decision-making.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes