CLAILGAug 12, 2025

An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems

arXiv:2508.08833v27 citationsh-index: 2Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for more accurate evaluation of LLMs' mathematical reasoning capabilities for researchers and developers, though it is incremental as it builds on existing benchmarking methods with a new framework.

The paper tackled the problem of assessing LLMs' robustness in mathematical reasoning by stress-testing them on advanced math problems with mathematically-equivalent variations, revealing sharp performance degradation across models, such as OpenAI's O3 dropping from 51.5% on originals to 46.8% on surface-renaming variants and 38.6% on parametric variants.

In this paper, we introduce a systematic framework beyond conventional method to assess LLMs' mathematical-reasoning robustness by stress-testing them on advanced math problems that are mathematically equivalent but with linguistic and parametric variation. These transformations allow us to measure the sensitivity of LLMs to non-mathematical perturbations, thereby enabling a more accurate evaluation of their mathematical reasoning capabilities. Using this new evaluation methodology, we created PutnamGAP, a new benchmark dataset with multiple mathematically-equivalent variations of competition-level math problems. With the new dataset, we evaluate multiple families of representative LLMs and examine their robustness. Across 18 commercial and open-source models we observe sharp performance degradation on the variants. OpenAI's flagship reasoning model, O3, scores 51.5% on the originals but drops by 4.7 percentage points on surface-renaming variants, and by 12.9 percentage points on parametric variants, while smaller models fare far worse. Overall, the results show that the proposed new evaluation methodology is effective for deepening our understanding of the robustness of LLMs and generating new insights for further improving their mathematical reasoning capabilities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes