AI LGJan 20

On the Generalization Gap in LLM Planning: Tests and Verifier-Reward RL

Valerio Belcamino, Nicholas Attolino, Alessio Capitanelli, Fulvio Mastrogiovanni

arXiv:2601.14456v12.4

Originality Incremental advance

AI Analysis

This work addresses the generalization gap in LLM-based planning for AI researchers, showing it is incremental by revealing limitations in current fine-tuning approaches.

The study investigated whether fine-tuned LLMs achieve transferable planning competence or rely on domain-specific memorization, finding that while in-domain valid plan rates reached 82.9%, cross-domain performance collapsed to 0%, indicating a reliance on domain-specific patterns.

Recent work shows that fine-tuned Large Language Models (LLMs) can achieve high valid plan rates on PDDL planning tasks. However, it remains unclear whether this reflects transferable planning competence or domain-specific memorization. In this work, we fine-tune a 1.7B-parameter LLM on 40,000 domain-problem-plan tuples from 10 IPC 2023 domains, and evaluate both in-domain and cross-domain generalization. While the model reaches 82.9% valid plan rate in in-domain conditions, it achieves 0% on two unseen domains. To analyze this failure, we introduce three diagnostic interventions, namely (i) instance-wise symbol anonymization, (ii) compact plan serialization, and (iii) verifier-reward fine-tuning using the VAL validator as a success-focused reinforcement signal. Symbol anonymization and compact serialization cause significant performance drops despite preserving plan semantics, thus revealing strong sensitivity to surface representations. Verifier-reward fine-tuning reaches performance saturation in half the supervised training epochs, but does not improve cross-domain generalization. For the explored configurations, in-domain performance plateaus around 80%, while cross-domain performance collapses, suggesting that our fine-tuned model relies heavily on domain-specific patterns rather than transferable planning competence in this setting. Our results highlight a persistent generalization gap in LLM-based planning and provide diagnostic tools for studying its causes.

View on arXiv PDF

Similar