CLMar 27, 2025

Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models

arXiv:2503.21380v249 citationsh-index: 25Has Code
Originality Synthesis-oriented
AI Analysis

This provides a more challenging and rigorous evaluation framework for researchers and developers working on large reasoning models, though it is incremental as it builds on existing benchmark efforts.

The authors tackled the saturation of existing benchmarks for evaluating mathematical reasoning in large language models by introducing OlymMATH, an Olympiad-level benchmark with 200 problems in English and Chinese, resulting in notably limited accuracy for state-of-the-art models on the hard subset.

In recent years, the rapid development of large reasoning models has resulted in the saturation of existing benchmarks for evaluating mathematical reasoning, highlighting the urgent need for more challenging and rigorous evaluation frameworks. To address this gap, we introduce OlymMATH, a novel Olympiad-level mathematical benchmark, designed to rigorously test the complex reasoning capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions. The problems are systematically organized into two distinct difficulty tiers: (1) AIME-level problems (easy) that establish a baseline for mathematical reasoning assessment, and (2) significantly more challenging problems (hard) designed to push the boundaries of current state-of-the-art models. In our benchmark, these problems span four core mathematical fields, each including a verifiable numerical solution to enable objective, rule-based evaluation. Empirical results underscore the significant challenge presented by OlymMATH, with state-of-the-art models including DeepSeek-R1, OpenAI's o3-mini and Gemini 2.5 Pro Exp demonstrating notably limited accuracy on the hard subset. Furthermore, the benchmark facilitates comprehensive bilingual assessment of mathematical reasoning abilities-a critical dimension that remains largely unaddressed in mainstream mathematical reasoning benchmarks. We release the benchmark, evaluation code, detailed results and a data visualization tool at https://github.com/RUCAIBox/OlymMATH.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes