AgentBench: Evaluating LLMs as Agents
This work addresses the need for quantitative evaluation of LLMs as agents, which is crucial for researchers and developers working on AI agents, though it is incremental as it builds on existing benchmarking efforts.
The authors tackled the problem of evaluating large language models (LLMs) as agents by introducing AgentBench, a multi-dimensional benchmark with 8 environments, and found that top commercial LLMs perform strongly, but there is a significant performance gap with many open-source models, with failures often due to poor long-term reasoning and instruction following.
The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over \num API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and many OSS competitors that are no larger than 70B. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents. Improving instruction following and training on high quality multi-round alignment data could improve agent performance. And different from existing assumptions, training on code present ambivalent impacts on different agent tasks. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench.