AICLLGMay 23

Learning to Reason Efficiently with A* Post-Training

arXiv:2605.2459744.3
Predicted impact top 6% in AI · last 90 daysOriginality Incremental advance
AI Analysis

For LLM reasoning tasks, this work provides a method to improve both correctness and efficiency by integrating classical search algorithms into training, though it is incremental as it combines existing techniques.

The authors frame natural language inference as a search problem and use A* search to guide LLMs in generating correct and efficient proofs. Llama-3.2 models (1B-3B) improved from near-zero accuracy to outperforming DeepSeek-V3.2, with A*-informed signals balancing accuracy and efficiency.

Many applications of large language models (LLMs) require deductive reasoning, yet models frequently produce incorrect or redundant inference steps. We frame natural language inference as a search problem where the final answer is the valid proof itself, requiring a reasoning procedure in which intermediate inferences are correct. Specifically, we investigate whether LLMs can learn to generate correct and efficient proofs with guidance from A* search -- an algorithm that guarantees an optimally efficient path to a goal. We explore two training techniques: supervised fine-tuning on execution traces from A* and reinforcement learning with A*-informed process reward models. Empirically, we find that Llama-3.2 models in the 1B--3B range benefit substantially from A* post training, going from near-zero accuracy to outperforming DeepSeek-V3.2 -- a much larger model. Our analysis uncovers a trade-off: while simple correctness rewards maximize accuracy, A*-informed signals strike a balance between accuracy and efficiency. Furthermore, we find that on larger search spaces, models trained with imperfect heuristics exhibit superior accuracy. Our results demonstrate a promising direction towards reasoning guided by principles derived from classical search algorithms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes