CLOct 10, 2025

StatEval: A Comprehensive Benchmark for Large Language Models in Statistics

arXiv:2510.09517v14 citationsh-index: 5Has Code
Originality Synthesis-oriented
AI Analysis

This addresses the lack of rigorous benchmarks for statistics in LLMs, which is crucial for researchers and developers aiming to improve AI's reasoning in this domain, though it is incremental as it focuses on benchmarking rather than novel model development.

The authors introduced StatEval, a comprehensive benchmark for large language models in statistics, covering foundational and research-level tasks, and found that current models, including GPT5-mini, achieve below 57% on research-level problems, highlighting limitations in statistical reasoning.

Large language models (LLMs) have demonstrated remarkable advances in mathematical and logical reasoning, yet statistics, as a distinct and integrative discipline, remains underexplored in benchmarking efforts. To address this gap, we introduce \textbf{StatEval}, the first comprehensive benchmark dedicated to statistics, spanning both breadth and depth across difficulty levels. StatEval consists of 13,817 foundational problems covering undergraduate and graduate curricula, together with 2374 research-level proof tasks extracted from leading journals. To construct the benchmark, we design a scalable multi-agent pipeline with human-in-the-loop validation that automates large-scale problem extraction, rewriting, and quality control, while ensuring academic rigor. We further propose a robust evaluation framework tailored to both computational and proof-based tasks, enabling fine-grained assessment of reasoning ability. Experimental results reveal that while closed-source models such as GPT5-mini achieve below 57\% on research-level problems, with open-source models performing significantly lower. These findings highlight the unique challenges of statistical reasoning and the limitations of current LLMs. We expect StatEval to serve as a rigorous benchmark for advancing statistical intelligence in large language models. All data and code are available on our web platform: https://stateval.github.io/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes