Can Large Language Models Generalize Procedures Across Representations?
This addresses the challenge of adapting LLMs to real-world user tasks specified in natural language, which is incremental as it builds on existing training methods with a new curriculum.
The study tackled the problem of whether large language models (LLMs) can generalize procedures across different representations like code, graphs, and natural language, finding that existing training methods fail to reliably generalize, but a proposed two-stage data curriculum substantially improves performance, with a 1.5B Qwen model closely matching zero-shot GPT-4o in naturalistic planning.
Large language models (LLMs) are trained and tested extensively on symbolic representations such as code and graphs, yet real-world user tasks are often specified in natural language. To what extent can LLMs generalize across these representations? Here, we approach this question by studying isomorphic tasks involving procedures represented in code, graphs, and natural language (e.g., scheduling steps in planning). We find that training LLMs with popular post-training methods on graphs or code data alone does not reliably generalize to corresponding natural language tasks, while training solely on natural language can lead to inefficient performance gains. To address this gap, we propose a two-stage data curriculum that first trains on symbolic, then natural language data. The curriculum substantially improves model performance across model families and tasks. Remarkably, a 1.5B Qwen model trained by our method can closely match zero-shot GPT-4o in naturalistic planning. Finally, our analysis suggests that successful cross-representation generalization can be interpreted as a form of generative analogy, which our curriculum effectively encourages.