Do LLM Evaluators Prefer Themselves for a Reason?
This addresses a critical reliability issue in LLM-based evaluation for AI benchmarking and applications, providing nuanced insights but is incremental in refining existing concerns.
The study investigated whether LLMs' self-preference bias in evaluation is harmful or legitimate by using verifiable benchmarks with objective ground truth, finding that stronger models' self-preference often aligns with superior performance but harmful bias persists when they err, and scaling strategies can reduce it.
Large language models (LLMs) are increasingly used as automatic evaluators in applications like benchmarking, reward modeling, and self-refinement. Prior work highlights a potential self-preference bias where LLMs favor their own generated responses, a tendency often intensifying with model size and capability. This raises a critical question: Is self-preference harmful, or does it simply reflect the genuinely higher-quality outputs of stronger models? Answering this has been difficult as previous studies relied primarily on subjective tasks. These tasks lack an objective ground truth, meaning that either preference can be reasonably justified. To address this ambiguity, we investigate self-preference using verifiable benchmarks (mathematical reasoning, factual knowledge, code generation) that allow objective ground-truth assessment. This enables us to distinguish harmful self-preference (favoring objectively worse responses) from legitimate self-preference (favoring genuinely superior ones). We conduct large-scale experiments under controlled evaluation conditions across diverse model families (e.g., Llama, Qwen, Gemma, Mistral, Phi, GPT, DeepSeek). Our findings reveal three key insights: (1) While stronger models exhibit greater self-preference, much of this preference aligns with objectively superior performance, indicating stronger models prefer themselves mostly legitimately. (2) Harmful self-preference persists when evaluator models err as generators, and stronger models display more pronounced harmful self-preference when they do err. This suggests stronger models struggle more to recognize when they are wrong. (3) Inference-time scaling strategies, such as generating a long Chain-of-Thought before evaluation, effectively reduce harmful self-preference. These results provide a more nuanced understanding of LLM-based evaluation and practical insights for improving its reliability.