LGJan 28, 2025

Token-by-Token Regeneration and Domain Biases: A Benchmark of LLMs on Advanced Mathematical Problem-Solving

arXiv:2501.17084v13 citationsh-index: 3Has Code

Originality Incremental advance

AI Analysis

It benchmarks LLMs on advanced mathematical problem-solving, highlighting performance disparities and trade-offs in efficiency, which is incremental for researchers and developers in AI and education.

This study evaluated 10 LLMs on 945 competition-level math problems, finding a 34.5% performance gap between the best commercial model (83.7% accuracy) and worst open-source model (49.2% accuracy), with token-by-token regeneration slightly improving accuracy by 0.8% but reducing code execution time by 36.7% for one model.

Large language models (LLMs) excel in many natural language tasks, yet they struggle with complex mathemat-ical problem-solving, particularly in symbolic reasoning and maintaining consistent output. This study evalu-ates 10 LLMs with 7 to 8 billion parameters using 945 competition-level problems from the MATH dataset. The focus is on their ability to generate executable Python code as a step in their reasoning process, involving over 9,450 code executions. The research introduces an evaluation framework using mistral-large-2411 to rate answers on a 5-point scale, which helps address inconsistencies in mathematical notation. It also examines the impact of regenerating output token-by-token on refining results. The findings reveal a significant 34.5% per-formance gap between the top commercial model (gpt-4o-mini, scoring 83.7%) and the least effective open-source model (open-codestral-mamba:v0.1, scoring 49.2%). This disparity is especially noticeable in complex areas like Number Theory. While token-by-token regeneration slightly improved accuracy (+0.8%) for the model llama3.1:8b, it also reduced code execution time by 36.7%, highlighting a trade-off between efficiency and precision. The study also noted a consistent trend where harder problems correlated with lower accuracy across all models. Despite using controlled execution environments, less than 1% of the generated code was unsafe, and 3.17% of problems remained unsolved after 10 attempts, suggesting that hybrid reasoning methods may be beneficial.

View on arXiv PDF

Similar