AIOct 14, 2024

Agent-as-a-Judge: Evaluate Agents with Agents

arXiv:2410.10934v2157 citationsh-index: 16ICML
Originality Incremental advance
AI Analysis

This addresses the need for better evaluation methods for agentic systems in AI development, though it appears incremental as an extension of the LLM-as-a-Judge framework.

The authors tackled the problem of inadequate evaluation techniques for agentic systems by introducing the Agent-as-a-Judge framework, which uses agents to evaluate other agents and provides intermediate feedback during task-solving. They applied this framework to code generation, creating a new benchmark called DevAI with 55 tasks and 365 requirements, and found that Agent-as-a-Judge dramatically outperformed LLM-as-a-Judge and matched human evaluation reliability.

Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes -- ignoring the step-by-step nature of agentic systems, or require excessive manual labour. To address this, we introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process. We apply the Agent-as-a-Judge to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present DevAI, a new benchmark of 55 realistic automated AI development tasks. It includes rich manual annotations, like a total of 365 hierarchical user requirements. We benchmark three of the popular agentic systems using Agent-as-a-Judge and find it dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether, we believe that Agent-as-a-Judge marks a concrete step forward for modern agentic systems -- by providing rich and reliable reward signals necessary for dynamic and scalable self-improvement.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes