AIJul 22, 2025

INTEGRALBENCH: Benchmarking LLMs with Definite Integral Problems

Bintao Tang, Xin Yang, Yuhao Wang, Zixuan Qiu, Zimo Ji, Wenyuan Jiang

arXiv:2507.21130v12 citationsh-index: 1

Originality Synthesis-oriented

AI Analysis

This provides a rigorous evaluation framework for automated mathematical reasoning, specifically for definite integral computation, but is incremental as it focuses on a narrow domain within existing benchmarking efforts.

The authors tackled the problem of evaluating Large Language Models (LLMs) on definite integral problems by introducing INTEGRALBENCH, a benchmark with symbolic and numerical ground truth solutions, and found significant performance gaps and correlations between difficulty and accuracy across nine state-of-the-art models.

We present INTEGRALBENCH, a focused benchmark designed to evaluate Large Language Model (LLM) performance on definite integral problems. INTEGRALBENCH provides both symbolic and numerical ground truth solutions with manual difficulty annotations. Our evaluation of nine state-of-the-art LLMs reveals significant performance gaps and strong correlations between problem difficulty and model accuracy, establishing baseline metrics for this challenging domain. INTEGRALBENCH aims to advance automated mathematical reasoning by providing a rigorous evaluation framework specifically tailored for definite integral computation.

View on arXiv PDF

Similar