AICEMAApr 28

Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital

arXiv:2604.2609166.8
Predicted impact top 57% in AI · last 90 daysOriginality Incremental advance
AI Analysis

For developers of capital-managing autonomous agents, this work demonstrates that reliability requires an operating layer beyond the base model, with concrete improvements from targeted harness changes.

The paper presents a 21-day deployment of autonomous language-model agents managing real ETH in a bounded onchain market, achieving 99.9% settlement success across 7.5M invocations and ~$20M volume. Reliability emerged from the operating layer (prompt compilation, policy validation, execution guards) rather than the base model alone.

We study reliability in autonomous language-model agents that translate user mandates into validated tool actions under real capital. The setting is DX Terminal Pro, a 21-day deployment in which 3,505 user-funded agents traded real ETH in a bounded onchain market. Users configured vaults through structured controls and natural-language strategies, but only agents could choose normal buy/sell trades. The system produced 7.5M agent invocations, roughly 300K onchain actions, about $20M in volume, more than 5,000 ETH deployed, roughly 70B inference tokens, and 99.9% settlement success for policy-valid submitted transactions. Long-running agents accumulated thousands of sequential decisions, including 6,000+ prompt-state-action cycles for continuously active agents, yielding a large-scale trace from user mandate to rendered prompt, reasoning, validation, portfolio state, and settlement. Reliability did not come from the base model alone; it emerged from the operating layer around the model: prompt compilation, typed controls, policy validation, execution guards, memory design, and trace-level observability. Pre-launch testing exposed failures that text-only benchmarks rarely measure, including fabricated trading rules, fee paralysis, numeric anchoring, cadence trading, and misread tokenomics. Targeted harness changes reduced fabricated sell rules from 57% to 3%, reduced fee-led observations from 32.5% to below 10%, and increased capital deployment from 42.9% to 78.0% in an affected test population. We show that capital-managing agents should be evaluated across the full path from user mandate to prompt, validated action, and settlement.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes