Probabilistic Programs of Thought

Poorva Garg, Renato Lui Geh, Daniel Israel, Todd Millstein, Kyle Richardson, Guy Van den Broeck

arXiv:2604.1729090.01 citationsh-index: 42

AI Analysis

This work addresses the computational bottleneck of sampling many programs from LLMs for code generation and reasoning tasks, offering a more efficient test-time method.

LLMs are used for code generation and mathematical reasoning, but generating many samples is computationally expensive. The authors propose probabilistic programs of thought, a test-time framework that uses next-token probabilities to compactly represent exponentially many programs, enabling cheaper sampling without additional GPU compute. They report performance improvements on code generation, code understanding, and mathematical reasoning benchmarks with fewer LLM generations.

LLMs are widely used for code generation and mathematical reasoning tasks where they are required to generate structured output. They either need to reason about code, generate code for a given specification, or reason using programs of thought. The typical approach to code generation is to prompt the model and generate samples until an appropriate program is obtained. Within this process, sampling $n$ programs from the language model requires $n$ GPU compute-intensive generations which becomes prohibitively expensive for larger values of $n$. In this work, we address this limitation by exposing the LLM's distribution within the generated programs themselves. We propose a novel test-time framework we dub probabilistic programs of thought to obtain more samples from the model with fewer LLM generations. Given a program generated by a model and the associated next-token probabilities, we build a probabilistic program that compactly represents exponentially many deterministic programs. Since performing probabilistic reasoning in this probabilistic program is much cheaper, our approach allows sampling new programs without any additional GPU compute and little CPU overhead. We instantiate our approach on benchmarks for code generation, code understanding and mathematical reasoning and report improvements in performance with fewer generations from the LLM.

View on arXiv PDF

Similar