CLFeb 17, 2025

AURORA:Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification

Xiaoyu Tan, Tianchu Yao, Chao Qu, Bin Li, Minghao Yang, Dakuan Lu, Haozhe Wang, Xihe Qiu, Wei Chu, Yinghui Xu, Yuan Qi

arXiv:2502.11520v121.320 citationsh-index: 9Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of automated process evaluation for AI researchers and developers, offering an incremental improvement through a novel framework and benchmark.

The paper tackles the challenge of evaluating and optimizing complex reasoning processes in large language models by introducing AURORA, an automated framework for training universal process reward models using ensemble prompting and reverse verification, which enhances process evaluation accuracy and improves reward model performance for diverse policy distributions and long chain-of-thought responses.

The reasoning capabilities of advanced large language models (LLMs) like o1 have revolutionized artificial intelligence applications. Nevertheless, evaluating and optimizing complex reasoning processes remain significant challenges due to diverse policy distributions and the inherent limitations of human effort and accuracy. In this paper, we present AURORA, a novel automated framework for training universal process reward models (PRMs) using ensemble prompting and reverse verification. The framework employs a two-phase approach: First, it uses diverse prompting strategies and ensemble methods to perform automated annotation and evaluation of processes, ensuring robust assessments for reward learning. Second, it leverages practical reference answers for reverse verification, enhancing the model's ability to validate outputs and improving training accuracy. To assess the framework's performance, we extend beyond the existing ProcessBench benchmark by introducing UniversalBench, which evaluates reward predictions across full trajectories under diverse policy distribtion with long Chain-of-Thought (CoT) outputs. Experimental results demonstrate that AURORA enhances process evaluation accuracy, improves PRMs' accuracy for diverse policy distributions and long-CoT responses. The project will be open-sourced at https://auroraprm.github.io/. The Universal-PRM-7B is available at https://huggingface.co/infly/Universal-PRM-7B.

View on arXiv PDF

Similar