BenchAgents: Multi-Agent Systems for Structured Benchmark Creation
This addresses the need for scalable, high-quality benchmarks to evaluate evolving generative AI models, though it is incremental as it builds on existing LLM and multi-agent methods.
The authors tackled the problem of slow and expensive manual benchmark creation by introducing BenchAgents, a multi-agent framework that automates the process using LLMs, resulting in benchmarks for planning, constraint satisfaction, and causal reasoning across language and vision modalities.
Evaluation insights are limited by the availability of high-quality benchmarks. As models evolve, there is a need to create benchmarks that can measure progress on new and complex generative capabilities. However, manually creating new benchmarks is slow and expensive, restricting comprehensive evaluations for any capability. We introduce BenchAgents, a multi-agent framework that methodically leverages large language models (LLMs) to automate evaluation benchmark creation while inherently ensuring data and (evaluation) metric quality. BenchAgents decomposes the benchmark creation process into planning, generation, verification, and evaluation, each of which is ] orchestrated via LLM agents. These agents interact with each other and utilize feedback from benchmark developers to improve and flexibly control data diversity and quality. We use BenchAgents to create benchmarks to evaluate capabilities related to planning, constraint satisfaction, and causal reasoning spanning both language and vision modalities. We then use these benchmarks to study state-of-the-art models and extract new insights into common failure modes and model differences.