SEAIMAMay 10

An Executable Benchmarking Suite for Tool-Using Agents

arXiv:2605.1103078.4
Predicted impact top 17% in SE · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the need for rigorous, reproducible evaluation of tool-using agents by providing a standardized benchmarking framework that separates evidence from artifacts.

The paper introduces an executable benchmarking suite that formalizes evidence admission for tool-using agents, connecting multiple environments under a shared contract. The suite enables auditable comparisons, as demonstrated by a controller study where different stress conditions select different controller variants.

Closed-loop tool-using agents are increasingly evaluated in executable web, code, and micro-task environments, but benchmark reports often conflate workloads, action-generating drivers, and the evidence admitted for systems-facing claims. We present an executable benchmarking suite that makes these objects explicit under a shared evidence-admission contract. The suite connects WebArena Verified, a SWE-Gym slice with SWE-bench-compatible verification, and MiniWoB++ through common workload adapters, task manifests, event schemas, replay/freeze policy, declared drivers, and reporting pipelines. In the canonical release, the gate separates paper-facing evidence from preflight, fixture, smoke, and diagnostic rows while preserving non-admitted artifacts for audit and onboarding. The admitted evidence records latency, invalid-action behavior, patch-generation cost, verifier metadata, replay bindings, and provenance under one auditable contract. The gate is decision-relevant rather than merely clerical: in a separate WebArena Verified controller study, clean-baseline and medium live-stressed evaluation select different fixed controller variants under the same workload and admission contract. The release is scoped as a benchmarking suite and admitted evidence, not a new agent policy, model leaderboard, backend comparison, or autonomous SWE-bench solver.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes