CLMay 21, 2025

Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory

arXiv:2505.15055v224 citationsh-index: 22
Originality Incremental advance
AI Analysis

This addresses the issue of unreliable benchmarking for LLM developers and researchers, offering a method to improve evaluation accuracy, though it is incremental as it builds on existing IRT frameworks.

The paper tackles the problem of inconsistent and poorly separable benchmarks for evaluating large language models (LLMs) by proposing PSN-IRT, an enhanced Item Response Theory framework, and reveals significant shortcomings in 11 benchmarks with 41,871 items, showing it can construct smaller benchmarks that better align with human preferences.

The evaluation of large language models (LLMs) via benchmarks is widespread, yet inconsistencies between different leaderboards and poor separability among top models raise concerns about their ability to accurately reflect authentic model capabilities. This paper provides a critical analysis of benchmark effectiveness, examining mainstream prominent LLM benchmarks using results from diverse models. We first propose Pseudo-Siamese Network for Item Response Theory (PSN-IRT), an enhanced Item Response Theory framework that incorporates a rich set of item parameters within an IRT-grounded architecture. PSN-IRT can be utilized for accurate and reliable estimations of item characteristics and model abilities. Based on PSN-IRT, we conduct extensive analysis on 11 LLM benchmarks comprising 41,871 items, revealing significant and varied shortcomings in their measurement quality. Furthermore, we demonstrate that leveraging PSN-IRT is able to construct smaller benchmarks while maintaining stronger alignment with human preference.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes