CLAIFeb 18, 2025

Text2World: Benchmarking Large Language Models for Symbolic World Model Generation

arXiv:2502.13092v221 citationsh-index: 20ACL
Originality Incremental advance
AI Analysis

This work addresses evaluation challenges in LLM-based world modeling for AI planning and simulation research, though it is incremental as it builds on existing methods with a new benchmark.

The paper tackles the problem of evaluating large language models (LLMs) for generating symbolic world models from text by introducing a new benchmark called Text2World, which uses PDDL domains and execution-based metrics, and finds that reasoning models with reinforcement learning perform best but still have limited capabilities.

Recently, there has been growing interest in leveraging large language models (LLMs) to generate symbolic world models from textual descriptions. Although LLMs have been extensively explored in the context of world modeling, prior studies encountered several challenges, including evaluation randomness, dependence on indirect metrics, and a limited domain scope. To address these limitations, we introduce a novel benchmark, Text2World, based on planning domain definition language (PDDL), featuring hundreds of diverse domains and employing multi-criteria, execution-based metrics for a more robust evaluation. We benchmark current LLMs using Text2World and find that reasoning models trained with large-scale reinforcement learning outperform others. However, even the best-performing model still demonstrates limited capabilities in world modeling. Building on these insights, we examine several promising strategies to enhance the world modeling capabilities of LLMs, including test-time scaling, agent training, and more. We hope that Text2World can serve as a crucial resource, laying the groundwork for future research in leveraging LLMs as world models. The project page is available at https://text-to-world.github.io/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes