CLApr 25, 2024

Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks

arXiv:2404.16966v244 citationsh-index: 7ACL
Originality Incremental advance
AI Analysis

This addresses the robustness of LLM evaluation for researchers and practitioners, highlighting a critical but often overlooked issue in benchmarking practices.

The paper tackles the problem that benchmarks for evaluating Large Language Models (LLMs) assume test prompts are random samples from a real-world distribution, which is often not true, and finds that accounting for correlations across test prompts can change model rankings on major benchmarks.

Benchmarks have emerged as the central approach for evaluating Large Language Models (LLMs). The research community often relies on a model's average performance across the test prompts of a benchmark to evaluate the model's performance. This is consistent with the assumption that the test prompts within a benchmark represent a random sample from a real-world distribution of interest. We note that this is generally not the case; instead, we hold that the distribution of interest varies according to the specific use case. We find that (1) the correlation in model performance across test prompts is non-random, (2) accounting for correlations across test prompts can change model rankings on major benchmarks, (3) explanatory factors for these correlations include semantic similarity and common LLM failure points.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes