Bridging Human and LLM Judgments: Understanding and Narrowing the Gap
This addresses the problem of unreliable LLM-as-a-judge evaluations for researchers and practitioners, offering a principled method to narrow human-LLM gaps, though it is incremental as it builds on existing evaluation paradigms.
The paper tackles the systematic divergence between human and LLM judgments in evaluating model outputs by introducing Bridge, a statistical framework that models LLM deviations as linear transformations to refine ratings, achieving higher agreement with human ratings on benchmarks like BigGen Bench and Chatbot Arena.
Large language models are increasingly used as judges (LLM-as-a-judge) to evaluate model outputs at scale, but their assessments often diverge systematically from human judgments. We present Bridge, a unified statistical framework that explicitly bridges human and LLM evaluations under both absolute scoring and pairwise comparison paradigms. Bridge posits a latent human preference score for each prompt-response pair and models LLM deviations as linear transformations of covariates that capture sources of discrepancies. This offers a simple and principled framework for refining LLM ratings and characterizing systematic discrepancies between humans and LLMs. We provide an efficient fitting algorithm with asymptotic guarantees for statistical inference. Using six LLM judges and two benchmarks (BigGen Bench and Chatbot Arena), Bridge achieves higher agreement with human ratings (accuracy, calibration, and KL divergence) and exposes systematic human-LLM gaps.