LGCLMLDec 23, 2025

Learning to Reason in LLMs by Expectation Maximization

arXiv:2512.20169v1h-index: 37
Originality Incremental advance
AI Analysis

This work addresses the challenge of enhancing reasoning capabilities in LLMs for tasks like question-answering, though it appears incremental as it builds on existing methods like STaR.

The paper tackles the problem of improving reasoning in large language models by formalizing reasoning as a latent variable model and deriving an expectation-maximization objective, showing that the sampling scheme for generating rationales significantly affects accuracy, with prompt posterior sampling outperforming others on datasets like ARC, MMLU, and OpenBookQA.

Large language models (LLMs) solve reasoning problems by first generating a rationale and then answering. We formalize reasoning as a latent variable model and derive an expectation-maximization (EM) objective for learning to reason. This view connects EM and modern reward-based optimization, and shows that the main challenge lies in designing a sampling distribution that generates rationales that justify correct answers. We instantiate and compare several sampling schemes: rejection sampling with a budget, self-taught reasoner (STaR), and prompt posterior sampling (PPS), which only keeps the rationalization stage of STaR. Our experiments on the ARC, MMLU, and OpenBookQA datasets with the Llama and Qwen models show that the sampling scheme can significantly affect the accuracy of learned reasoning models. Despite its simplicity, we observe that PPS outperforms the other sampling schemes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes