AICLMay 29, 2025

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

arXiv:2505.23281v2257 citationsh-index: 64
Originality Incremental advance
AI Analysis

This addresses the need for rigorous, contamination-free evaluation of mathematical reasoning in LLMs for researchers and developers, though it is incremental in benchmarking methodology.

The authors tackled the problem of evaluating LLMs' mathematical reasoning without contamination from memorization by introducing MathArena, a benchmark using real-time math competition problems, finding strong contamination in AIME 2024 and showing top models achieve nearly 40% on IMO 2025.

The rapid advancement of reasoning capabilities in large language models (LLMs) has led to notable improvements on mathematical benchmarks. However, many of the most commonly used evaluation datasets (e.g., AIME 2024) are widely available online, making it difficult to disentangle genuine reasoning from potential memorization. Furthermore, these benchmarks do not evaluate proof-writing capabilities, which are crucial for many mathematical tasks. To address this, we introduce MathArena, a new benchmark based on the following key insight: recurring math competitions provide a stream of high-quality, challenging problems that can be used for real-time evaluation of LLMs. By evaluating models as soon as new problems are released, we effectively eliminate the risk of contamination. Using this framework, we find strong signs of contamination in AIME 2024. Nonetheless, evaluations on harder competitions, such as CMIMC 2025, demonstrate impressive reasoning capabilities in top-performing models. MathArena is also the first benchmark for proof-writing capabilities. On IMO 2025, top models achieve slightly less than 40%, demonstrating both notable progress and significant room for improvement. So far, we have evaluated over $50$ models across seven competitions, totaling $162$ problems. As an evolving benchmark, MathArena will continue to track the progress of LLMs on newly released competitions, ensuring rigorous and up-to-date evaluation of mathematical reasoning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes