MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs
This work addresses the need for more challenging evaluation frameworks to better discriminate between advanced LLMs, though it is incremental as it builds upon the existing MMLU-Pro benchmark.
The authors tackled the problem of existing benchmarks failing to differentiate top-performing large language models by introducing MMLU-Pro+, an enhanced benchmark that assesses higher-order reasoning and shortcut learning through questions with multiple correct answers. Their results show that MMLU-Pro+ maintains difficulty while providing a more rigorous test, revealing significant performance gaps and bias susceptibility in six state-of-the-art LLMs.
Existing benchmarks for large language models (LLMs) increasingly struggle to differentiate between top-performing models, underscoring the need for more challenging evaluation frameworks. We introduce MMLU-Pro+, an enhanced benchmark building upon MMLU-Pro to assess shortcut learning and higher-order reasoning in LLMs. By incorporating questions with multiple correct answers across diverse domains, MMLU-Pro+ tests LLMs' ability to engage in complex reasoning and resist simplistic problem-solving strategies. Our results show that MMLU-Pro+ maintains MMLU-Pro's difficulty while providing a more rigorous test of model discrimination, particularly in multi-correct answer scenarios. We introduce novel metrics like shortcut selection ratio and correct pair identification ratio, offering deeper insights into model behavior and anchoring bias. Evaluations of six state-of-the-art LLMs reveal significant performance gaps, highlighting variations in reasoning abilities and bias susceptibility. We release the dataset and evaluation codes at \url{https://github.com/asgsaeid/mmlu-pro-plus}.