AIMay 11

OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents

arXiv:2605.1116998.1
Predicted impact top 5% in AI · last 90 daysOriginality Incremental advance
AI Analysis

For developers deploying LLM agents on repeated multi-step tasks, OLIVIA reduces error accumulation and wasted tool calls via explicit, trackable action-level adaptation.

OLIVIA introduces an online learning layer for ReAct-style LLM agents that models action selection as a contextual linear bandit, enabling lightweight, uncertainty-aware adaptation during deployment. Across four benchmarks, it consistently outperforms static ReAct and prompt-based baselines in task performance.

Large language model agents interleave reasoning, action selection, and observation to solve sequential decision-making tasks. In deployed settings where agents repeatedly handle related multi-step tasks, small action-selection errors can accumulate into wasted tool calls, latency, and reduced reliability. Despite this need for deployment-time improvement, existing inference-time adaptation methods for LLM agents mainly rely on prompting or retrieval, which influence behavior indirectly through context manipulation. For ReAct-style agents, such approaches do not expose an explicit decision layer that can score candidate actions, represent uncertainty, or be updated online from action-level feedback. As a result, they provide limited support for trackable, fine-grained, and uncertainty-aware adaptation during deployment. We propose OLIVIA, an inference-time action adaptation framework for ReAct-style agents. OLIVIA models the LLM's final action-selection layer as a contextual linear bandit over candidate actions, with frozen hidden states as decision contexts. This choice is particularly suitable for deployment because it adapts behavior directly at the action-selection interface, preserves the underlying reasoning process, and provides explicit uncertainty estimates and lightweight online updates from action-level feedback. With upper-confidence-bound exploration, OLIVIA improves the policy sample-efficiently with minimal computational overhead. We instantiate OLIVIA on four benchmarks and show that it consistently improves task performance over static ReAct and prompt-based inference-time baselines. Our results suggest that explicit online decision layers provide an effective alternative to purely prompt- or retrieval-based adaptation for LLM agents during deployment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes