CL AI LGJul 3, 2024

Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages

Max Zuo, Francisco Piedrahita Velez, Xiaochen Li, Michael L. Littman, Stephen H. Bach

arXiv:2407.03321v217.331 citationsh-index: 9Has Code

Originality Incremental advance

AI Analysis

This addresses the need for a rigorous benchmark in planning and AI research, though it is incremental as it builds on existing evaluation methods.

The authors tackled the problem of evaluating language models for translating natural language to structured planning languages like PDDL, by introducing Planetarium, a benchmark that includes a dataset of 145,918 text-to-PDDL pairs and a novel equivalence algorithm, revealing that only 24.8% of GPT-4o-generated PDDL descriptions are semantically correct.

Recent works have explored using language models for planning problems. One approach examines translating natural language descriptions of planning tasks into structured planning languages, such as the planning domain definition language (PDDL). Existing evaluation methods struggle to ensure semantic correctness and rely on simple or unrealistic datasets. To bridge this gap, we introduce \textit{Planetarium}, a benchmark designed to evaluate language models' ability to generate PDDL code from natural language descriptions of planning tasks. \textit{Planetarium} features a novel PDDL equivalence algorithm that flexibly evaluates the correctness of generated PDDL, along with a dataset of 145,918 text-to-PDDL pairs across 73 unique state combinations with varying levels of difficulty. Finally, we evaluate several API-access and open-weight language models that reveal this task's complexity. For example, 96.1\% of the PDDL problem descriptions generated by GPT-4o are syntactically parseable, 94.4\% are solvable, but only 24.8\% are semantically correct, highlighting the need for a more rigorous benchmark for this problem.

View on arXiv PDF Code

Similar