A Close Look At World Model Recovery In Supervised Fine-Tuned LLM Planners
For researchers studying how LLMs represent knowledge in planning tasks, this work provides a recipe for interpretability analysis and insights into world model recovery.
The paper investigates whether supervised fine-tuning (SFT) on planning tasks enables LLMs to recover the underlying world model. Through interpretability experiments, they find that SFT on valid action sequences allows LLMs to linearly encode action validity and some state predicates, and broader state space coverage improves world model recovery.
Supervised fine-tuning (SFT) improves end-to-end classical planning in large language models (LLMs), but do these models also learn to represent and reason about the planning problems they are solving? Due to the relative complexity of classical planning problems and the challenge that end-to-end plan generation poses for LLMs, it has been difficult to explore this question. In our work, we devise and perform a series of interpretability experiments that holistically interrogate world model recovery by examining both internal representations and generative capabilities of fine-tuned LLMs. We find that: a) Supervised fine-tuning on valid action sequences enables LLMs to linearly encode action validity and some state predicates. b) Models that struggle to use output probabilities for classifying action validity may still learn internal representations that separate valid from invalid actions. c) Broader state space coverage during fine-tuning, such as from random walk data, yields more accurate recovery of the underlying world model. In summary, this work contributes a recipe for applying interpretability techniques to planning LLMs and generates insights that shed light on open questions about how knowledge is represented in LLMs.