CLAISep 24, 2025

Do Before You Judge: Self-Reference as a Pathway to Better LLM Evaluation

arXiv:2509.19880v12 citationsh-index: 14EMNLP
Originality Incremental advance
AI Analysis

This addresses the inconsistency in LLM-as-Judge frameworks for AI evaluation, offering a practical improvement for researchers and practitioners in model selection tasks.

The paper tackled the weak correlation between LLMs' generation and judgment abilities by proposing a self-reference-guided evaluation strategy that uses a model's own answers as references, which significantly strengthened this correlation and provided a reliable proxy for model selection.

LLM-as-Judge frameworks are increasingly popular for AI evaluation, yet research findings on the relationship between models' generation and judgment abilities remain inconsistent. We investigate this relationship through systematic dataset- and instance-level analyses across 11 models and 21 diverse tasks. Despite both capabilities relying on the same underlying knowledge, our analyses reveal they are only weakly correlated, primarily due to LLMs' sensitivity to the responses being judged. To address this, we propose a self-reference-guided evaluation strategy that leverages a model's own answers as references. This approach significantly strengthens the correlation between generation and judgment abilities, offering a practical path to align these skills and providing a reliable proxy for model selection in evaluation tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes