AIMay 23

DemoEvolve: Overcoming Sparse Feedback in Agentic Harness Evolution with Demonstrations

arXiv:2605.2453964.5
AI Analysis

For researchers improving frozen language-model agents via harness evolution, DemoEvolve addresses the bottleneck of sparse feedback in long-horizon stochastic tasks, offering a more stable and diagnosable approach.

DemoEvolve uses human demonstrations to guide harness evolution for frozen language agents, overcoming sparse feedback in long-horizon stochastic environments. On Balatro, it outperforms self-rollout evolution and tutorial-like knowledge, producing more effective and auditable edits.

Agent harness evolution improves frozen language-model agents by modifying the executable structures around them. We study this paradigm as a form of sample-efficient fast adaptation: instead of updating model weights, an agent can acquire task-specific competence by changing its external harness, while leaving the base model's general capabilities intact. Prior work shows that self-generated rollouts can support harness search, suggesting that agents may acquire new task competence through practice. Yet in long-horizon stochastic environments, self-practice becomes fragile: rewards are sparse, outcomes are high-variance, and failures are hard to attribute to concrete harness mechanisms. We introduce DemoEvolve, a demonstration-bootstrapped approach to harness evolution. When reward-only search is too broad and noisy, competent human trajectories serve as expert reference experience for the coding proposer, guiding harness-level diagnosis and editing. Experiments on Liar's Dice show that self-rollout evolution can work when episodes are short and failures are attributable. In contrast, Balatro exposes a harder long-horizon stochastic regime, where self-rollout evolution is misled by sparse feedback and candidate-selection noise, while tutorial-like textual knowledge alone does not yield stable improvement. Under the same limited budget, DemoEvolve produces more effective and auditable harness edits and achieves better performance. Overall, demonstrations make sparse-feedback harness evolution more diagnosable, localizable, and stable.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes