CLAIMar 16, 2025

Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models

arXiv:2503.13551v414 citationsh-index: 4Has Code
Originality Incremental advance
AI Analysis

This work addresses reliability and cost issues in reward modeling for LLM reasoning, offering an incremental improvement with potential benefits for AI researchers and developers focused on enhancing reasoning tasks.

The paper tackles the problem of reward hacking and high annotation costs in Process Reward Models (PRMs) for LLM reasoning by proposing a Hierarchical Reward Model (HRM) that evaluates reasoning steps at multiple levels and a data augmentation strategy called Hierarchical Node Compression (HNC). Results show HRM with HNC provides more stable evaluations than PRM on the PRM800K dataset and demonstrates strong generalization on MATH500 and GSM8K datasets.

Recent studies show that Large Language Models (LLMs) achieve strong reasoning capabilities through supervised fine-tuning or reinforcement learning. However, a key approach, the Process Reward Model (PRM), suffers from reward hacking, making it unreliable in identifying the best intermediate step. In addition, the cost of annotating reasoning processes for reward modeling is high, making large-scale collection of high-quality data challenging. To address this, we propose a novel reward model approach called the Hierarchical Reward Model (HRM), which evaluates both individual and consecutive reasoning steps at both fine-grained and coarse-grained levels. HRM excels at assessing multi-step reasoning coherence, especially when flawed steps are later corrected through self-reflection. To further reduce the cost of generating training data, we introduce a lightweight and effective data augmentation strategy called Hierarchical Node Compression (HNC), which merges two consecutive reasoning steps into one within the tree structure. By applying HNC to MCTS-generated reasoning trajectories, we enhance the diversity and robustness of HRM training data while introducing controlled noise with minimal computational overhead. Empirical results on the PRM800K dataset show that HRM, together with HNC, provides more stable and reliable evaluations than PRM. Furthermore, cross-domain evaluations on the MATH500 and GSM8K datasets demonstrate HRM's strong generalization and robustness across a variety of reasoning tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes