AIAug 18, 2025

EvolMathEval: Towards Evolvable Benchmarks for Mathematical Reasoning via Evolutionary Testing

arXiv:2508.13003v23 citationsh-index: 13
Originality Incremental advance
AI Analysis

This addresses the challenge of accurately evaluating SOTA LLMs in mathematical reasoning by creating evolvable benchmarks, though it is incremental as it builds on existing evolutionary testing methods for a specific domain.

The paper tackles the problem of mathematical reasoning benchmarks becoming easier as LLMs learn from them, by introducing EvolMathEval, an automated framework that generates and evolves high-difficulty problems, reducing model accuracy by an average of 48% and identifying a 'Pseudo Aha Moment' phenomenon causing 77% to 100% of errors.

The rapid advancement of Large Language Models (LLMs) poses a significant challenge to existing mathematical reasoning benchmarks. However, these benchmarks tend to become easier over time as LLMs can learn from the published benchmarks. This limitation hinder the precise evaluation of the true capabilities of SOTA models. To address this challenge, this paper introduces EvolMathEval, an automated mathematical benchmark generation and evolution framework based on evolutionary testing. Experimental results demonstrate that EvolMathEval can not only generate a large volume of high-difficulty problems through continuous self-iteration, but it can also significantly enhance the complexity of public datasets like GSM8K through evolution, reducing model accuracy by an average of 48\%. Deeper investigation reveals that when solving these evolved problems, LLMs tend to bypass complex multi-step logical reasoning by relying on simplistic and fuzzy conditions, consequently leading to incorrect solutions. We define this phenomenon as the ``Pseudo Aha Moment", which we find accounts for 77\% to 100\% of errors on targeted problems. Code and resources are available at: https://anonymous.4open.science/r/EvolMathEval

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes