CLApr 24, 2025

Evaluating Intermediate Reasoning of Code-Assisted Large Language Models for Mathematics

arXiv:2504.17665v22 citationsh-index: 7Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for more holistic evaluations of code-assisted LLMs in mathematics, revealing critical shortcomings in reasoning beyond execution accuracy.

The study analyzed code-assisted large language models (LLMs) on mathematical reasoning tasks, finding that closed-source models use sound mathematical concepts while open-source models often rely on unsound reasoning, with soundness decreasing as problem difficulty increases.

Assisting LLMs with code generation improved their performance on mathematical reasoning tasks. However, the evaluation of code-assisted LLMs is generally restricted to execution correctness, lacking a rigorous evaluation of their generated programs. In this work, we bridge this gap by conducting an in-depth analysis of code-assisted LLMs generated programs in response to math reasoning tasks, with a focus on evaluating the soundness of the underlying reasoning processes. For this purpose, we assess the generations of five LLMs, on several math datasets, both manually and automatically, and propose a taxonomy of generated programs based on their logical soundness. Our findings show that the capabilities of models significantly impact the logic implemented to solve the problem. Closed-source LLMs ground their programs in mathematical concepts, whereas open-source models often resort to unsound reasoning, relying on memorized information and exhaustive searches. Furthermore, increasing the difficulty of problems decreases sound generations for all models, revealing a critical shortcoming of LLMs on complex mathematics, contrary to what accuracy metrics suggest. Our work highlights the need for more holistic evaluations of code-assisted LLMs beyond execution accuracy metrics, toward a better understanding of LLMs' limits in the math domain.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes