LGCLDec 4, 2025

Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space

arXiv:2512.04601v24 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses the challenge of unstable and sample-inefficient training for LLM agents in complex tasks like web browsing and tool-use, offering a more scalable paradigm, though it appears incremental as it builds on actor-critic methods.

The paper tackles the problem of training large language model (LLM) agents in long-horizon tasks with sparse rewards by proposing Natural Language Actor-Critic (NLAC), a novel actor-critic algorithm that uses a generative LLM critic to provide natural language feedback instead of scalar rewards, resulting in more stable and data-efficient training with promising performance improvements over existing methods.

Large language model (LLM) agents -- LLMs that dynamically interact with an environment over long horizons -- have become an increasingly important area of research, enabling automation in complex tasks involving tool-use, web browsing, and dialogue with people. In the absence of expert demonstrations, training LLM agents has relied on policy gradient methods that optimize LLM policies with respect to an (often sparse) reward function. However, in long-horizon tasks with sparse rewards, learning from trajectory-level rewards can be noisy, leading to training that is unstable and has high sample complexity. Furthermore, policy improvement hinges on discovering better actions through exploration, which can be difficult when actions lie in natural language space. In this paper, we propose Natural Language Actor-Critic (NLAC), a novel actor-critic algorithm that trains LLM policies using a generative LLM critic that produces natural language rather than scalar values. This approach leverages the inherent strengths of LLMs to provide a richer and more actionable training signal; particularly, in tasks with large, open-ended action spaces, natural language explanations for why an action is suboptimal can be immensely useful for LLM policies to reason how to improve their actions, without relying on random exploration. Furthermore, our approach can be trained off-policy without policy gradients, offering a more data-efficient and stable alternative to existing on-policy methods. We present results on a mixture of reasoning, web browsing, and tool-use with dialogue tasks, demonstrating that NLAC shows promise in outperforming existing training approaches and offers a more scalable and stable training paradigm for LLM agents.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes