A Fragile Number Sense: Probing the Elemental Limits of Numerical Reasoning in LLMs
This work highlights a critical weakness in LLMs' numerical reasoning for tasks requiring creative or novel insights, which is incremental as it builds on existing concerns about their capabilities.
The study investigated the robustness of numerical reasoning in Large Language Models (LLMs) by testing them on problems of escalating complexity, finding that while they performed well on deterministic algorithmic tasks, they consistently failed at a combinatorial number puzzle, revealing limitations in generative problem-solving.
Large Language Models (LLMs) have demonstrated remarkable emergent capabilities, yet the robustness of their numerical reasoning remains an open question. While standard benchmarks evaluate LLM reasoning on complex problem sets using aggregated metrics, they often obscure foundational weaknesses. In this work, we probe LLM mathematical numeracy by evaluating performance on problems of escalating complexity, from constituent operations to combinatorial puzzles. We test several state-of-the-art LLM-based agents on a 100-problem challenge comprising four categories: (1) basic arithmetic, (2) advanced operations, (3) primality checking, and (4) the Game of 24 number puzzle. Our results show that while the agents achieved high accuracy on the first three categories, which require deterministic algorithmic execution, they consistently failed at the number puzzle, underlining its demand for a heuristic search over a large combinatorial space to be a significant bottleneck. These findings reveal that the agents' proficiency is largely confined to recalling and executing known algorithms, rather than performing generative problem-solving. This suggests their apparent numerical reasoning is more akin to sophisticated pattern-matching than flexible, analytical thought, limiting their potential for tasks that require novel or creative numerical insights.