AICYMay 10

Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning

arXiv:2605.0929261.1
AI Analysis

For researchers evaluating LLM reasoning, this work highlights strategy diversity as a complementary metric to accuracy, revealing a decoupling that current benchmarks miss.

The paper introduces a strategy-level evaluation framework for LLM mathematical reasoning, finding that while models achieve high answer accuracy (95-100%), they recover substantially fewer strategies than humans (e.g., Gemini 184 vs. 217 human strategies), with the best model recovering only 71% of reference strategies after three runs.

Large language models now achieve high final-answer accuracy on mathematical reasoning benchmarks, but accuracy alone does not capture reasoning flexibility. We introduce a strategy-level evaluation framework instantiated on 80 AMC 10/12 and AIME problems with 217 AoPS-derived reference strategy families. Model outputs are annotated for strategy identity, validity, and correctness using dual-AI coding with human adjudication. Across four frontier models, we find a pronounced decoupling between answer accuracy and strategy diversity. Under a single-solution prompt, all models achieve high accuracy (95%-100%), but under a multiple-strategy prompt they recover substantially fewer strategies than the human reference set. Gemini, DeepSeek, GPT, and Claude generate 184, 152, 151, and 110 distinct valid strategies, respectively, with the largest gaps in Geometry and Number Theory. The models collectively produce 50 benchmark-novel valid strategies, indicating both incomplete coverage of human strategies and some capacity for alternative reasoning. A repeated-run robustness check on 20 problems shows diminishing gains in discovered strategies, with the strongest model recovering only 39 of 55 AoPS-reference strategies (71%) after three runs. These findings position strategy diversity as a complementary dimension for evaluating mathematical reasoning beyond answer correctness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes