AIMar 6

The World Won't Stay Still: Programmable Evolution for Agent Benchmarks

Guangrui Li, Yaochen Xie, Yi Liu, Ziwei Dong, Xingyuan Pan, Tianqi Zheng, Jason Choi, Michael J. Morais, Binit Jha, Shaunak Mishra, Bingrou Zhou, Chen Luo

arXiv:2603.05910v125.9h-index: 11

Predicted impact top 5% in AI · last 90 daysOriginality Highly original

AI Analysis

This work tackles the problem of evaluating the adaptability of LLM-powered agents to real-world dynamic environments, which is crucial for developing more robust and generalizable agents.

This paper addresses the limitation of static environments in LLM-powered agent benchmarks by proposing ProEvolve, a graph-based framework for programmable environment evolution. ProEvolve represents environments as typed relational graphs, allowing for scalable and controllable evolution through graph transformations, and was used to evolve a single environment into 200 environments and 3,000 task sandboxes for benchmarking.

LLM-powered agents fulfill user requests by interacting with environments, querying data, and invoking tools in a multi-turn process. Yet, most existing benchmarks assume static environments with fixed schemas and toolsets, neglecting the evolutionary nature of the real world and agents' robustness to environmental changes. In this paper, we study a crucial problem: how to evolve the agent environment in a scalable and controllable way, thereby better evaluating agents' adaptability to real-world dynamics. We propose ProEvolve, a graph-based framework that makes environment evolution programmable. At its core, a typed relational graph provides a unified, explicit representation of the environment: data, tools, and schema. Under this formalism, adding, removing, or modifying capabilities are expressed as graph transformations that coherently propagate updates across tools, schemas, and data access. Building on this, ProEvolve can (1) program the evolutionary dynamics as graph transformations to generate environments automatically, and (2) instantiate task sandboxes via subgraph sampling and programming. We validate ProEvolve by evolving a single environment into 200 environments and 3,000 task sandboxes, and benchmark representative agents accordingly.

View on arXiv PDF

Similar