CL AI LGAug 14, 2025

Reinforced Language Models for Sequential Decision Making

Jim Dilkes, Vahid Yazdanpanah, Sebastian Stein

arXiv:2508.10839v11 citationsh-index: 13

Originality Incremental advance

AI Analysis

This provides a practical alternative to scaling model size for creating sequential decision-making agents, though it appears incremental as it builds on existing post-training methods.

The authors tackled the problem of improving smaller language models for sequential decision-making tasks by developing Multi-Step Group-Relative Policy Optimization (MS-GRPO), a post-training algorithm that addresses credit assignment in multi-step tasks. Their 3-billion parameter model outperformed a 72-billion parameter baseline by 50% on the Frozen Lake task.

Large Language Models (LLMs) show potential as sequential decision-making agents, but their application is often limited due to a reliance on large, computationally expensive models. This creates a need to improve smaller models, yet existing post-training methods are designed for single-turn interactions and cannot handle credit assignment in multi-step agentic tasks. To address this, we introduce Multi-Step Group-Relative Policy Optimization (MS-GRPO), a new algorithm for post-training LLM agents, grounded in formal Text-Mediated Stochastic Game (TSMG) and Language-Agent Policy (LAP) frameworks. For credit assignment, MS-GRPO attributes the entire cumulative episode reward to each individual episode step. We supplement this algorithm with a novel absolute-advantage-weighted episode sampling strategy that we show improves training performance. We evaluate our approach by post-training a 3-billion parameter model on Snake and Frozen Lake. Our experiments demonstrate that the method is effective in improving decision-making performance: our post-trained 3B parameter model outperforms a 72B parameter baseline by 50% on the Frozen Lake task. This work demonstrates that targeted post-training is a practical and efficient alternative to relying on model scale for creating sequential decision-making agents using LLMs.

View on arXiv PDF

Similar