CLAILGJan 24, 2025

Tuning LLM Judge Design Decisions for 1/1000 of the Cost

arXiv:2501.17178v47 citationsh-index: 4Has CodeICML
Originality Incremental advance
AI Analysis

This work addresses the problem of costly LLM evaluation for researchers and practitioners, offering a more accessible and reproducible method, though it is incremental as it builds on existing LLM judge approaches.

The paper tackles the high cost of evaluating LLMs by systematically tuning hyperparameters of LLM-based judges, using multi-objective multi-fidelity optimization to reduce search costs, resulting in judges that outperform benchmarks in accuracy and cost-efficiency while using open-weight models.

Evaluating Large Language Models (LLMs) often requires costly human annotations. To address this, LLM-based judges have been proposed, which compare the outputs of two LLMs enabling the ranking of models without human intervention. While several approaches have been proposed, many confounding factors are present between different papers. For instance the model, the prompt and other hyperparameters are typically changed at the same time making apple-to-apple comparisons challenging. In this paper, we propose to systematically analyze and tune the hyperparameters of LLM judges. To alleviate the high cost of evaluating a judge, we propose to leverage multi-objective multi-fidelity which allows to find judges that trade accuracy for cost and also significantly reduce the cost of the search. Our method identifies judges that not only outperform existing benchmarks in accuracy and cost-efficiency but also utilize open-weight models, ensuring greater accessibility and reproducibility. The code to reproduce our experiments is available at this repository https://github.com/geoalgo/judgetuning .

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes