LGSep 26, 2025

Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas

arXiv:2509.22957v14 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work addresses external validity concerns for AI researchers and practitioners by providing a method to improve generalization from lab-based to real-world evaluations, though it is incremental as it builds on existing doubly-robust and LLM-as-a-judge approaches.

The paper tackles the problem of evaluation sampling bias in Generative AI systems by proposing a doubly-robust estimation framework that combines imperfect LLM-generated persona ratings with biased human ratings to produce statistically valid system quality estimates, validated theoretically and through a Persona Simulation Framework.

As Generative AI (GenAI) systems see growing adoption, a key concern involves the external validity of evaluations, or the extent to which they generalize from lab-based to real-world deployment conditions. Threats to the external validity of GenAI evaluations arise when the source sample of human raters and system outputs used to obtain a system quality estimate differs from the target distribution at deployment time. In this work, we propose a doubly-robust estimation framework designed to address this evaluation sampling bias. Key to our approach is the use of "persona" ratings produced by prompting an LLM evaluator (i.e., an LLM-as-a-judge) to behave as a human rater with specific sociodemographic characteristics. Our doubly-robust framework combines these informative yet imperfect persona ratings with human ratings obtained under evaluation sampling bias to produce statistically valid system quality estimates. In particular, we show that our approach yields valid system quality estimates when either (i) a model trained to predict human ratings using persona ratings and source data observed under sampling bias, or (ii) a reweighting model that corrects for sampling bias is of sufficient quality. We validate our framework theoretically and via a novel Persona Simulation Framework (PSF) designed to systematically manipulate persona quality and the degree of evaluation sampling bias present in source data. Our work provides a principled foundation for combining imperfect persona ratings with human ratings observed under sampling bias to obtain valid system quality estimates.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes