CLMar 31, 2025

Benchmarking Large Language Models for Calculus Problem-Solving: A Comparative Analysis

arXiv:2504.13187v1
Originality Synthesis-oriented
AI Analysis

This work benchmarks LLMs for educational applications in calculus, highlighting their procedural strengths but incremental limitations in conceptual reasoning compared to humans.

The study evaluated five large language models on solving calculus differentiation problems, finding significant performance disparities with Chat GPT 4o achieving the highest success rate of 94.71% and Meta AI the lowest at 56.75%, while all models struggled with conceptual understanding tasks like optimization word problems.

This study presents a comprehensive evaluation of five leading large language models (LLMs) - Chat GPT 4o, Copilot Pro, Gemini Advanced, Claude Pro, and Meta AI - on their performance in solving calculus differentiation problems. The investigation assessed these models across 13 fundamental problem types, employing a systematic cross-evaluation framework where each model solved problems generated by all models. Results revealed significant performance disparities, with Chat GPT 4o achieving the highest success rate (94.71%), followed by Claude Pro (85.74%), Gemini Advanced (84.42%), Copilot Pro (76.30%), and Meta AI (56.75%). All models excelled at procedural differentiation tasks but showed varying limitations with conceptual understanding and algebraic manipulation. Notably, problems involving increasing/decreasing intervals and optimization word problems proved most challenging across all models. The cross-evaluation matrix revealed that Claude Pro generated the most difficult problems, suggesting distinct capabilities between problem generation and problem-solving. These findings have significant implications for educational applications, highlighting both the potential and limitations of LLMs as calculus learning tools. While they demonstrate impressive procedural capabilities, their conceptual understanding remains limited compared to human mathematical reasoning, emphasizing the continued importance of human instruction for developing deeper mathematical comprehension.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes