CL LGAug 20, 2024

Benchmarking Large Language Models for Math Reasoning Tasks

Kathrin Seßler, Yao Rong, Emek Gözlüklü, Enkelejda Kasneci

arXiv:2408.10839v28.715 citationsh-index: 45Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of selecting appropriate LLMs for specific math reasoning tasks, providing a benchmark for researchers and practitioners, though it is incremental as it builds on existing datasets and algorithms.

The authors tackled the lack of comprehensive benchmarking for large language models (LLMs) in math reasoning by comparing seven in-context learning algorithms across five datasets on four foundation models, finding that larger models like GPT-4o and LLaMA 3-70B perform well regardless of prompting, while smaller models are more sensitive to the approach.

The use of Large Language Models (LLMs) in mathematical reasoning has become a cornerstone of related research, demonstrating the intelligence of these models and enabling potential practical applications through their advanced performance, such as in educational settings. Despite the variety of datasets and in-context learning algorithms designed to improve the ability of LLMs to automate mathematical problem solving, the lack of comprehensive benchmarking across different datasets makes it complicated to select an appropriate model for specific tasks. In this project, we present a benchmark that fairly compares seven state-of-the-art in-context learning algorithms for mathematical problem solving across five widely used mathematical datasets on four powerful foundation models. Furthermore, we explore the trade-off between efficiency and performance, highlighting the practical applications of LLMs for mathematical reasoning. Our results indicate that larger foundation models like GPT-4o and LLaMA 3-70B can solve mathematical reasoning independently from the concrete prompting strategy, while for smaller models the in-context learning approach significantly influences the performance. Moreover, the optimal prompt depends on the chosen foundation model. We open-source our benchmark code to support the integration of additional models in future research.

View on arXiv PDF Code

Similar