CLJul 1, 2025

Pitfalls of Evaluating Language Models with Open Benchmarks

Md. Najib Hasan, Mohammad Fakhruddin Babar, Souvika Sarkar, Monowar Hasan, Santu Karmaker

arXiv:2507.00460v113.07 citationsh-index: 5

Originality Incremental advance

AI Analysis

This work addresses critical evaluation flaws for researchers and practitioners in AI, revealing that open benchmarks can be gamed, which is an incremental but important insight for improving assessment practices.

The study exposed pitfalls in open LLM benchmarks by showing that small models fine-tuned on public test sets can achieve top rankings on HELM without generalizing well, highlighting a disconnect between leaderboard performance and real-world utility.

Open Large Language Model (LLM) benchmarks, such as HELM and BIG-bench, offer standardized, transparent protocols that facilitate the fair comparison, reproducibility, and iterative advancement of Language Models (LMs). However, their openness also introduces critical and underexplored pitfalls. This study exposes these weaknesses by systematically constructing ``cheating'' models -- smaller variants of BART, T5, and GPT-2 fine-tuned directly on public test sets -- which achieve top rankings on a prominent open, holistic benchmark (HELM) despite poor generalization and limited practical utility. Our findings underscore three key insights: \ca high leaderboard performance on open benchmarks may not always reflect real-world effectiveness; \cb private or dynamic benchmarks must complement open evaluations to safeguard integrity; and \cc a fundamental reevaluation of current benchmarking practices is essential to ensure robust and trustworthy LM assessments.

View on arXiv PDF

Similar