XGrammar 2: Dynamic and Efficient Structured Generation Engine for Agentic LLMs
This addresses the problem of inefficient structured generation for LLM agents, offering a significant performance improvement for developers and researchers in AI and NLP, though it appears incremental as it builds on previous work like XGrammar.
The paper tackles the challenge of dynamic structured generation tasks for LLM agents, such as tool calling and conditional generation, by proposing XGrammar 2, an optimized engine that achieves over 6x speedup over existing engines with near-zero overhead when integrated with LLM inference.
Modern LLM agents are required to handle increasingly complex structured generation tasks, such as tool calling and conditional structured generation. These tasks are significantly more dynamic than predefined structures, posing new challenges to the current structured generation engines. In this paper, we propose XGrammar 2, a highly optimized structured generation engine for agentic LLMs. XGrammar 2 accelerates the mask generation for these dynamic structured generation tasks through a new dynamic dispatching semantics: TagDispatch. We further introduce a just-in-time (JIT) compilation method to reduce compilation time and a cross-grammar caching mechanism to leverage the common sub-structures across different grammars. Additionally, we extend the previous PDA-based mask generation algorithm to the Earley-parser-based one and design a repetition compression algorithm to handle repetition structures in grammars. Evaluation results show that XGrammar 2 can achieve more than 6x speedup over the existing structured generation engines. Integrated with an LLM inference engine, XGrammar 2 can handle dynamic structured generation tasks with near-zero overhead.