When Should an AI Workflow Release? Always-Valid Inference for Black-Box Generate-Verify Systems
For developers of black-box generate-verify AI workflows, this provides a statistically rigorous method to decide when to stop iterating without requiring likelihood models or exchangeability assumptions.
The paper addresses the problem of when to stop and release results from iterative LLM-based workflows, proposing an always-valid release wrapper that controls the probability of releasing on infeasible tasks while still releasing on feasible ones. In an MBPP+ coding-agent case study, the wrapper reduces premature incorrect releases compared to baseline stopping rules.
LLM-enabled AI workflows increasingly produce outputs through iterative generate-evaluate-revise loops. Each iteration can improve the candidate, but it also creates a release decision: when to stop and output the current result? This raises a statistical challenge because deployment-time evaluator scores are adaptively generated and repeatedly monitored, yet the likelihood models or exchangeability assumptions typically used for calibration are unavailable. We propose an always-valid release wrapper for existing generator-evaluator pipelines. The wrapper builds a hard-negative reference pool of high-scoring failures, calibrates deployment-time evaluator scores against this pool, and accumulates the resulting evidence with an e-process. This separates two roles: the reference pool turns black-box scores into conservative evidence, while the e-process provides validity under optional stopping. In theory, we show that a conservative reference pool yields finite-sample control of the probability of releasing on infeasible tasks, that is, tasks for which the given workflow is not capable of producing a reliable solution. We also characterize conditions under which the same conservative rule still achieves nontrivial release on feasible tasks. In an MBPP+ coding-agent case study, the wrapper reduces premature incorrect release relative to baseline stopping rules while still releasing on tasks for which the workflow repeatedly accumulates moderate supporting evidence.