Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey
It addresses the problem of improving LLM reasoning for researchers and practitioners by synthesizing existing knowledge, but it is incremental as a survey rather than presenting new methods.
This paper provides a systematic survey of reward models (RMs) and their applications in enhancing large language model (LLM) reasoning, covering fundamental concepts, key uses like guiding generation and facilitating self-improvement, and discussing open questions for future research.
Reward models (RMs) play a critical role in enhancing the reasoning performance of LLMs. For example, they can provide training signals to finetune LLMs during reinforcement learning (RL) and help select the best answer from multiple candidates during inference. In this paper, we provide a systematic introduction to RMs, along with a comprehensive survey of their applications in LLM reasoning. We first review fundamental concepts of RMs, including their architectures, training methodologies, and evaluation techniques. Then, we explore their key applications: (1) guiding generation and selecting optimal outputs during LLM inference, (2) facilitating data synthesis and iterative self-improvement for LLMs, and (3) providing training signals in RL-based finetuning. Finally, we discuss critical open questions regarding the selection, generalization, evaluation, and enhancement of RMs, based on existing research and our own empirical findings. Our analysis aims to provide actionable insights for the effective deployment and advancement of RMs for LLM reasoning.