AIOct 14, 2024

Agent-as-a-Judge: Evaluate Agents with Agents

Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra

arXiv:2410.10934v239.6161 citationsh-index: 16Has CodeICML

Originality Incremental advance

AI Analysis

This addresses the need for better evaluation methods for agentic systems in AI development, though it appears incremental as an extension of the LLM-as-a-Judge framework.

The authors tackled the problem of inadequate evaluation techniques for agentic systems by introducing the Agent-as-a-Judge framework, which uses agents to evaluate other agents and provides intermediate feedback during task-solving. They applied this framework to code generation, creating a new benchmark called DevAI with 55 tasks and 365 requirements, and found that Agent-as-a-Judge dramatically outperformed LLM-as-a-Judge and matched human evaluation reliability.

Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes -- ignoring the step-by-step nature of agentic systems, or require excessive manual labour. To address this, we introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process. We apply the Agent-as-a-Judge to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present DevAI, a new benchmark of 55 realistic automated AI development tasks. It includes rich manual annotations, like a total of 365 hierarchical user requirements. We benchmark three of the popular agentic systems using Agent-as-a-Judge and find it dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether, we believe that Agent-as-a-Judge marks a concrete step forward for modern agentic systems -- by providing rich and reliable reward signals necessary for dynamic and scalable self-improvement.

View on arXiv PDF Code

Similar