CLAIJan 8

Agent-as-a-Judge

arXiv:2601.05111v19 citationsh-index: 7
Originality Synthesis-oriented
AI Analysis

This work offers a roadmap for researchers and practitioners in AI evaluation to improve reliability in assessing specialized and multi-step tasks.

The paper addresses the limitations of LLM-as-a-Judge in evaluating complex AI systems by proposing Agent-as-a-Judge, and it provides the first comprehensive survey to establish a framework and taxonomy for this emerging paradigm.

LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes