CLSep 3, 2025

Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges

arXiv:2509.03419v27 citationsh-index: 17
Originality Incremental advance
AI Analysis

This addresses the challenge of reliable evaluation for LLMs in complex tasks, which is incremental as it builds on prior work focused on simple settings.

The paper tackled the problem of LLMs as judges in complex tasks, constructing ComplexEval to expose biases, and found all models are significantly susceptible with bias scaling with complexity, including paradoxical vulnerability in Large Reasoning Models.

As large language models (LLMs) grow more capable, they face increasingly diverse and complex tasks, making reliable evaluation challenging. The paradigm of LLMs as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings. Their reliability in complex tasks--where multi-faceted rubrics, unstructured reference answers, and nuanced criteria are critical--remains understudied. In this paper, we constructed ComplexEval, a challenge benchmark designed to systematically expose and quantify Auxiliary Information Induced Biases. We systematically investigated and validated 6 previously unexplored biases across 12 basic and 3 advanced scenarios. Key findings reveal: (1) all evaluated models exhibit significant susceptibility to these biases, with bias magnitude scaling with task complexity; (2) notably, Large Reasoning Models (LRMs) show paradoxical vulnerability. Our in-depth analysis offers crucial insights for improving the accuracy and verifiability of evaluation signals, paving the way for more general and robust evaluation models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes