Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

arXiv:2603.2221452.5h-index: 7Has Code

AI Analysis

This addresses the need for scalable and consistent evaluation of LLM outputs, though it is incremental in refining an emerging technique.

The study evaluated the reliability of using large language models (LLMs) as automated judges to assess the quality of other LLMs, finding high correlation with human assessments, particularly for models like GPT-4o and open-source models with at least 32B parameters.

A Large Language Model (LLM) as judge evaluates the quality of victim Machine Learning (ML) models, specifically LLMs, by analyzing their outputs. An LLM as judge is the combination of one model and one specifically engineered judge prompt that contains the criteria for the analysis. The resulting automation of the analysis scales up the complex evaluation of the victim models' free-form text outputs by faster and more consistent judgments compared to human reviewers. Thus, quality and security assessments of LLMs can cover a wide range of the victim models' use cases. Being a comparably new technique, LLMs as judges lack a thorough investigation for their reliability and agreement to human judgment. Our work evaluates the applicability of LLMs as automated quality assessors of victim LLMs. We test the efficacy of 37 differently sized conversational LLMs in combination with 5 different judge prompts, the concept of a second-level judge, and 5 models fine-tuned for the task as assessors. As assessment objective, we curate datasets for eight different categories of judgment tasks and the corresponding ground-truth labels based on human assessments. Our empirical results show a high correlation of LLMs as judges with human assessments, when combined with a suitable prompt, in particular for GPT-4o, several open-source models with $\geqslant$ 32B parameters, and a few smaller models like Qwen2.5 14B.

View on arXiv PDF

Similar