LGAIJul 7, 2025

Train-before-Test Harmonizes Language Model Rankings

arXiv:2507.05195v27 citationsh-index: 5
Originality Incremental advance
AI Analysis

This addresses model selection confusion for LLM developers and researchers, though it is an incremental methodological improvement rather than a paradigm shift.

The paper tackles the problem of contradictory language model rankings across benchmarks by proposing train-before-test, where models are fine-tuned identically before evaluation. The result shows this approach yields consistent rankings across 24 benchmarks and 61 models, restores the connection between perplexity and performance, and reduces the model-score matrix to rank one.

Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. In this paper, we take a different perspective on model comparison: instead of relying on out-of-the-box performance via direct evaluation, we compare model potential by providing each model with identical benchmark-specific fine-tuning before evaluation. We call this approach train-before-test. Our primary contribution is a comprehensive empirical evaluation of model potential across 24 benchmarks and 61 models. First, we demonstrate that model potential rankings obtained through train-before-test exhibit remarkable consistency across all benchmarks. Whereas traditional rankings demonstrate little external validity under direct evaluation, they enjoy a significant degree of external validity when applying train-before-test: model potential rankings transfer gracefully from one benchmark to another. Second, train-before-test restores the connection between perplexity and downstream task performance, lost under direct evaluation. Remarkably, even pre-finetuning perplexity of a base model predicts post-finetuning downstream performance, suggesting that ranking consistency reflects inherent model potential rather than fine-tuning artifacts. Finally, train-before-test reduces the model-score matrix to essentially rank one, indicating that model potential is dominated by one latent factor, uncovered by train-before-test. Our work supports the recommendation to make train-before-test a default component of LLM benchmarking.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes