CLFeb 5

BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks

arXiv:2602.06221v11 citationsh-index: 25
Originality Incremental advance
AI Analysis

This work addresses benchmark reliability issues in NLP evaluation, offering a practical tool for researchers, though it is incremental in applying education-inspired methods to existing problems.

The authors tackled the problem of quality flaws in multiple-choice question answering benchmarks by developing BenchMarker, a toolkit that uses LLM judges to detect contamination, shortcuts, and writing errors; validation on 12 benchmarks revealed that contaminated items inflate accuracy while writing errors lower it and affect rankings.

Multiple-choice question answering (MCQA) is standard in NLP, but benchmarks lack rigorous quality control. We present BenchMarker, an education-inspired toolkit using LLM judges to flag three common MCQ flaws: 1) contamination - items appearing exactly online; 2) shortcuts - cues in the choices that enable guessing; and 3) writing errors - structural/grammatical issues based on a 19-rule education rubric. We validate BenchMarker with human annotations, then run the tool to audit 12 benchmarks, revealing: 2) contaminated MCQs tend to inflate accuracy, while writing errors tend to lower it and change rankings beyond random; and 3) prior benchmark repairs address their targeted issues (i.e., lowering accuracy with LLM-written distractors), but inadvertently add new flaws (i.e. implausible distractors, many correct answers). Overall, flaws in MCQs degrade NLP evaluation, but education research offers a path forward. We release BenchMarker to bridge the fields and improve MCQA benchmark design.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes