SE AIMay 7

BUILD-AND-FIND: An Effort-Aware Protocol for Evaluating Agent-Managed Codebases

arXiv:2605.0613630.3

AI Analysis

For researchers evaluating agent-managed codebase engineering, this protocol provides a way to assess how well repositories communicate intent to future agents, beyond behavioral correctness.

The paper introduces BUILD-AND-FIND, a protocol to evaluate whether downstream agents can recover intended design choices from agent-generated codebases, measuring recovery accuracy, repeatability, coverage, and inspection effort. In a high-prior task pack, recovery accuracy is near saturation, making inspection effort the main differentiator.

Most coding-agent benchmarks ask whether generated code behaves correctly. That remains essential, but repository-level engineering is increasingly agent-managed: one agent writes a repository, and later agents inspect, audit, or extend it as working context. In that setting, a generated repository is not only an answer to a task but also a communication artifact for future work. Even when strong agents nearly satisfy the visible behavioral objective, repositories can differ in how clearly they expose the intended behavior and design choices behind that behavior. We introduce BUILD-AND-FIND, a protocol for evaluating whether downstream agents can recover those intended choices from generated repositories, and how much inspection that recovery requires. For each task, a builder sees a hidden repository specification and creates a codebase; a finder sees only the codebase and a specification-traced multiple-choice question bank. The protocol separates behavioral correctness from artifact-side recovery and reports recovery accuracy, repeatability, implementation coverage, and inspection effort. Accuracy and stability act as gates: effort is interpreted only when recovery succeeds reliably. Among artifacts from which the same intent can be recovered, lower effort by the same finder suggests that the artifact makes that intent easier to locate. Question-only and spec-only controls quantify generic priors and specification access, while audits separate omitted claims from finder failures and check whether correct answers cite artifact evidence. In the released high-prior task pack, recovery accuracy is near saturation, so inspection effort and finder-specific effects provide the main panel-local comparison.

View on arXiv PDF

Similar