Learning to Rank Chain-of-Thought: Using a Small Model
This addresses the need for efficient and reliable reasoning verification in LLMs, offering a practical tool for real-world applications, though it is incremental as it builds on existing verification methods.
The paper tackles the problem of unreliable mathematical reasoning in large language models by introducing EORM, a lightweight post-hoc verifier that ranks Chain-of-Thought solutions, boosting Llama 3 8B accuracy to 90.7% on GSM8k and 63.7% on MATH.
Large Language Models (LLMs) struggle with reliable mathematical reasoning, and current verification methods are often computationally expensive. This paper introduces the Energy Outcome Reward Model (EORM), a highly efficient, lightweight post-hoc verifier designed to address this challenge. EORM uses an energy-based framework to rank Chain-of-Thought (CoT) solutions, learning to distinguish correct from incorrect reasoning using only simple outcome labels, thus eliminating the need for expensive annotations. With only 55M parameters, over 127 times smaller than typical reward models, EORM boosts the accuracy of Llama 3 8B to 90.7\% on GSM8k and 63.7\% on MATH. This performance is achieved by efficiently selecting the optimal reasoning path from a pool of candidates, allowing it to match or exceed the accuracy of far more resource-intensive Best-of-N sampling techniques. Crucially, our experiments show that EORM generalizes effectively to out-of-distribution problems and unseen models, indicating it learns fundamental principles of valid reasoning. This robustness, combined with its efficiency, establishes EORM as a practical tool for deploying more dependable LLMs in complex, real-world applications.