Same Signal, Opposite Meaning: Direction-Informed Adaptive Learning for LLM Agents
For LLM agent deployment, this work addresses the instability of gating signals across heterogeneous settings, improving adaptive compute efficiency.
Existing adaptive compute methods for LLM agents assume a fixed direction between gating signals and compute utility, but the same signal can predict benefit in one setting and harm in another. DIAL learns the utility direction per (environment, backbone) via counterfactual exploration, achieving a stronger success-cost trade-off across six environments and three backbones.
Adaptive test-time compute for LLM agents aims to invoke extra computation only when it improves performance. Existing methods typically use confidence-, uncertainty-, or difficulty-based gates, assuming a fixed direction from the gating signal through compute need to the value of computation. This makes gating a utility-calibration problem: gating signals should align with whether extra computation improves the final outcome over the base policy. We show that this alignment is unstable: the same signal predicts rollout benefit in one setting and rollout harm in another, with reversals across environments and backbones even when the task is fixed. Wrong-direction gates can therefore worsen performance by precisely selecting harmful states. This reversal reflects a deeper distinction between compute need and compute suitability: a high uncertainty signal may indicate decision-difficult states where rollouts help compare alternatives, or intervention-unsuitable states where the current context does not support useful rollout-based improvement. Under this two-source model, fixed-direction gates are unreliable across heterogeneous settings. To address this, we propose DIAL (Direction-Informed Adaptive Learning), a sparse gate trained from signal-agnostic counterfactual exploration to learn the utility direction of state features per (environment, backbone). Across six environments and three backbones, DIAL yields a stronger overall success-cost trade-off than fixed-direction baselines.