CLMay 31

Robust Asynchronous Planning via Auto-Formalization

arXiv:2606.0098179.6
Predicted impact top 71% in CL · last 90 daysOriginality Incremental advance
AI Analysis

For AI planning researchers, this work identifies that the choice of formal representation (CP-SAT over PDDL2.1 or direct generation) is critical for scaling LLM-based planning to realistic asynchronous tasks.

LLMs struggle with asynchronous planning tasks involving non-uniform durations, concurrency, and execution-time constraints. By formalizing tasks as constraint satisfaction programs (CP-SAT), the approach achieves 83% plan accuracy at 100 actions, compared to 5% for direct planning and 0% for PDDL2.1 formalization, and recovers to 84.5% under execution-time updates via state-aware repair.

LLMs can plan by either generating action sequences directly as a Planner or translating tasks into domain specific language for an external solver as a Formalizer. While most real-world tasks are asynchronous with non-uniform durations, concurrency, and execution-time constraints, existing benchmarks hardly cover them. We unify these asynchronous planning challenges under a single formulation and introduce the first three benchmarks that address each at scale. We conclude that the choice of formal representation primarily determines whether planning scales: as dependency graphs grow from 5 to 100 actions, Planner collapses from 96% to 5% plan accuracy and PDDL2.1 Formalizer from 13% to 0%, while CP-SAT Formalizer averages 94% and still achieves 83% at 100 actions. Faithfulness diagnostics show that PDDL2.1's predicate-based planning representation becomes brittle compared to general constraint satisfaction programs, when LLMs must keep predicates, effects, and goals consistent. Execution-time updates of planning constraints further degrade performance sharply (Planner 23.9%, PDDL2.1 0.7%, CP-SAT 46.1%), but a state-aware repair strategy that updates only event-induced constraints recovers CP-SAT Formalizer to 84.5%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes