AI LGJun 9

Search Discipline for Long-Horizon Research Agents

arXiv:2606.11522v18.1h-index: 2

Predicted impact top 77% in AI · last 90 daysOriginality Highly original

AI Analysis

For researchers using automated agents to select scientific candidates, the paper reveals a critical flaw in aggregate metrics and offers a corrective protocol.

The paper identifies a failure mode where aggregate metrics rank the wrong scientific candidate first when validity is multi-dimensional, and proposes a search-discipline protocol that audits candidates on disaggregated behavior rather than the headline score.

Autoresearch agents now propose, evaluate, and select scientific candidates against a metric, and that metric is usually an aggregate reduced over a heterogeneous space of regions, slices, or cohorts. We show that when scientific validity lives in that disaggregated structure, the aggregate can rank the wrong candidate first. The headline number improves while the structure underneath inverts, so a decision made on the number accepts a candidate that quietly breaks the model. The failure is not domain-specific. It appears wherever a candidate's validity is multi-dimensional but its verifier is a single reduction. We demonstrate the inversion on a fire-model task in the Ecosystem Demography model. The highest-scoring candidate and a slightly lower one are within noise of each other on global score, yet the top-scoring one collapses the protected boreal regions while the other preserves them. What separates them is the per-region behavior, not the headline number. This decision should not be left to the agent that produced the candidates. The agent optimizing the score is the last party likely to catch the score being wrong, and a prompt has no remaining turn once the agent has stopped. We move the decision to an external control loop that audits each candidate on its disaggregated behavior and acts after the agent has decided. It can demote a candidate the agent would have accepted, and it can reopen a run the agent had declared finished. Our contribution is the inversion finding itself, and a search-discipline protocol that decides on reviewable candidate-effect evidence instead of the score.

View on arXiv PDF

Similar