CLFeb 25, 2025

WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging

arXiv:2502.18316v19 citationsh-index: 33Has CodeACL
Originality Synthesis-oriented
AI Analysis

This addresses the need for more robust evaluation of LLMs in AI research, though it is incremental as it builds on existing benchmarks.

The researchers tackled the problem of making multiple-choice benchmarks more challenging by introducing WiCkeD, a method that randomly replaces a choice with 'None of the above', resulting in an average performance drop of 12.1 points across 18 open-weight LLMs on 6 benchmarks.

We introduce WiCkeD, a simple method to increase the complexity of existing multiple-choice benchmarks by randomly replacing a choice with "None of the above", a method often used in educational tests. We show that WiCkeD can be automatically applied to any existing benchmark, making it more challenging. We apply WiCkeD to 6 popular benchmarks and use it to evaluate 18 open-weight LLMs. The performance of the models drops 12.1 points on average with respect to the original versions of the datasets. When using chain-of-thought on 3 MMLU datasets, the performance drop for the WiCkeD variant is similar to the one observed when using the LLMs directly, showing that WiCkeD is also challenging for models with enhanced reasoning abilities. WiCkeD also uncovers that some models are more sensitive to the extra reasoning required, providing additional information with respect to the original benchmarks. We relase our code and data at https://github.com/ahmedselhady/wicked-benchmarks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes