LGFeb 10, 2025

VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data

arXiv:2502.06737v232 citationsh-index: 89ICML
Originality Highly original
AI Analysis

This work addresses the problem of limited domain generalizability of PRMs for users who require robust mathematical reasoning capabilities across diverse domains.

The authors tackled the limitation of Process Reward Models (PRMs) in non-mathematical domains and achieved a 7.9% performance gain in the MMLU-Pro Law category with VersaPRM. VersaPRM outperformed the Qwen2.5-Math-PRM by 6.6% in this category.

Process Reward Models (PRMs) have proven effective at enhancing mathematical reasoning for Large Language Models (LLMs) by leveraging increased inference-time computation. However, they are predominantly trained on mathematical data and their generalizability to non-mathematical domains has not been rigorously studied. In response, this work first shows that current PRMs have poor performance in other domains. To address this limitation, we introduce VersaPRM, a multi-domain PRM trained on synthetic reasoning data generated using our novel data generation and annotation method. VersaPRM achieves consistent performance gains across diverse domains. For instance, in the MMLU-Pro category of Law, VersaPRM via weighted majority voting, achieves a 7.9% performance gain over the majority voting baseline -- surpassing Qwen2.5-Math-PRM's gain of 1.3%. We further contribute to the community by open-sourcing all data, code and models for VersaPRM.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes