LGAIMay 9

Large Language Models for Sequential Decision-Making: Improving In-Context Learning via Supervised Fine-Tuning

arXiv:2605.0900953.5
AI Analysis

For researchers and practitioners applying LLMs to sequential decision-making problems where offline data is abundant (e.g., healthcare), this work provides a practical method to improve performance via supervised fine-tuning.

This paper shows that supervised fine-tuning of large language models on offline, oracle-labeled trajectories significantly improves their in-context learning for sequential decision-making in MDPs, POMDPs, and ambiguous POMDPs, achieving substantially smaller optimality gaps than in-context-only and random baselines, especially in longer-horizon and partially observed settings.

Large language models (LLMs) have shown remarkable in-context learning (ICL) capabilities, yet their potential for sequential decision-making remains underexplored. In this paper, we study the ICL capabilities of LLMs in sequential decision-making settings, including Markov Decision Processes (MDPs), Partially Observable MDPs (POMDPs), and Ambiguous POMDPs (APOMDPs). We fine-tune pretrained LLMs to perform few-shot decision-making directly from offline, oracle-labeled trajectories. Our framework enables flexible imitation of policies through supervised fine-tuning (SFT). Theoretically, we focus on linear MDPs and interpret a fine-tuned attention layer as implicitly estimating optimal Q-functions from in-context data. Building on this interpretation, we derive an end-to-end suboptimality bound for the induced policy that separates the in-context estimation error from the training-length bias. Empirically, across synthetic MDP, POMDP, and APOMDP settings, we find that fine-tuned LLMs achieve substantially smaller optimality gaps than in-context-only and random baselines, with especially large gains in longer-horizon, partially observed, and model-ambiguous environments. Together, these results show that supervised fine-tuning provides an effective route to endowing pretrained LLMs with sequential decision-making capabilities from offline data, which is an important advantage in domains such as healthcare where offline data are abundant.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes