CLJun 4, 2025

Pre$^3$: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation

arXiv:2506.03887v110 citationsh-index: 8Has CodeACL
Originality Incremental advance
AI Analysis

This addresses faster structured output generation for LLM applications, but it is incremental as it builds on existing parsing methods.

The paper tackles the inefficiency of structured LLM generation for LR(1) grammars by proposing Pre$^3$, which uses deterministic pushdown automata to optimize decoding, resulting in up to 40% reduction in time per output token and 36% increase in throughput.

Extensive LLM applications demand efficient structured generations, particularly for LR(1) grammars, to produce outputs in specified formats (e.g., JSON). Existing methods primarily parse LR(1) grammars into a pushdown automaton (PDA), leading to runtime execution overhead for context-dependent token processing, especially inefficient under large inference batches. To address these issues, we propose Pre$^3$ that exploits deterministic pushdown automata (DPDA) to optimize the constrained LLM decoding efficiency. First, by precomputing prefix-conditioned edges during the preprocessing, Pre$^3$ enables ahead-of-time edge analysis and thus makes parallel transition processing possible. Second, by leveraging the prefix-conditioned edges, Pre$^3$ introduces a novel approach that transforms LR(1) transition graphs into DPDA, eliminating the need for runtime path exploration and achieving edge transitions with minimal overhead. Pre$^3$ can be seamlessly integrated into standard LLM inference frameworks, reducing time per output token (TPOT) by up to 40% and increasing throughput by up to 36% in our experiments. Our code is available at https://github.com/ModelTC/lightllm.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes