Reference-Free Rating of LLM Responses via Latent Information
This addresses the problem of unreliable LLM evaluation for researchers and practitioners, offering a more deterministic and discriminative method, though it is incremental as it builds on existing latent signal approaches.
The paper tackled the unreliability of single-response LLM-as-a-judge ratings without references by identifying issues like instability and poor calibration, and proposed Latent Judges using internal model signals, which matched or surpassed standard prompting across benchmarks, with probability-weighted scores achieving strong correlations and probes providing useful signals when logits are miscalibrated.
How reliable are single-response LLM-as-a-judge ratings without references, and can we obtain fine-grained, deterministic scores in this setting? We study the common practice of asking a judge model to assign Likert-scale scores to free-text responses and show two systematic issues: scores are unstable under sampling and poorly calibrated, leading to compression near the top of the scale and frequent ties. We then propose and evaluate Latent Judges, which derive scalar ratings from internal model signals: (i) probability-weighted scores over integer ratings, (ii) verifier-style probabilities of "yes", and (iii) linear probes trained on model activations at the rating position. Across a broad suite of pairwise and single-rating benchmarks, latent methods match or surpass standard prompting, with consistent gains on pairwise accuracy and listwise ranking relevant to Best-of-N selection. Probability-weighted scores achieve the strongest single-rating correlations, while probes recover useful signals when output logits are miscalibrated. These results indicate that latent information provides deterministic and more discriminative signals for reference-free evaluation, and can improve selection and training approaches like Best-of-$N$, multi-teacher distillation, and routing.