Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning
This work addresses the need for better evaluation metrics for synthetic verifiers in AI, particularly for code and reasoning tasks, but it is incremental as it builds on existing benchmarks and methods.
The paper tackles the problem of evaluating synthetic verification methods for code and reasoning in large language models by transforming existing coding benchmarks into scoring and ranking datasets, resulting in the release of four new benchmarks and findings that reasoning improves test case generation and scaling test cases enhances verification accuracy.
Synthetic verification techniques such as generating test cases and reward modelling are common ways to enhance the coding capabilities of large language models (LLM) beyond predefined tests. Additionally, code verification has recently found great success as a critical component in improving reasoning capability of LLMs via reinforcement learning. In this paper, we propose an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers. We also propose multiple metrics to measure different aspects of the synthetic verifiers with the proposed benchmarks. By employing the proposed approach, we release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs. Our experiments show that reasoning can significantly improve test case generation and that scaling the number of test cases enhances the verification accuracy.