LG CL MLDec 23, 2025

Learning to Reason in LLMs by Expectation Maximization

Junghyun Lee, Branislav Kveton, Sunav Choudhary, Subhojyoti Mukherjee, Anup Rao, Ryan A. Rossi, Alexa Siu

arXiv:2512.20169v14.1h-index: 37

Originality Incremental advance

AI Analysis

This work addresses the challenge of enhancing reasoning capabilities in LLMs for tasks like question-answering, though it appears incremental as it builds on existing methods like STaR.

The paper tackles the problem of improving reasoning in large language models by formalizing reasoning as a latent variable model and deriving an expectation-maximization objective, showing that the sampling scheme for generating rationales significantly affects accuracy, with prompt posterior sampling outperforming others on datasets like ARC, MMLU, and OpenBookQA.

Large language models (LLMs) solve reasoning problems by first generating a rationale and then answering. We formalize reasoning as a latent variable model and derive an expectation-maximization (EM) objective for learning to reason. This view connects EM and modern reward-based optimization, and shows that the main challenge lies in designing a sampling distribution that generates rationales that justify correct answers. We instantiate and compare several sampling schemes: rejection sampling with a budget, self-taught reasoner (STaR), and prompt posterior sampling (PPS), which only keeps the rationalization stage of STaR. Our experiments on the ARC, MMLU, and OpenBookQA datasets with the Llama and Qwen models show that the sampling scheme can significantly affect the accuracy of learned reasoning models. Despite its simplicity, we observe that PPS outperforms the other sampling schemes.

View on arXiv PDF

Similar