AIJun 2

GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory

Noujoud Nader, Ibrahem Aljabea, Patrick Diehl, Deepti Gupta

arXiv:2606.0314477.2

Predicted impact top 40% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For educators and researchers deploying LLMs in mathematical education and scientific research, this benchmark reveals severe performance disparities and systematic human-AI disagreement in evaluating complex reasoning.

GTBench introduces a curriculum-grounded benchmark of 63 graph theory problems across three difficulty levels to evaluate LLMs as mathematical research assistants. GPT-5 achieves 95.8% on undergraduate definitions and 82% on graduate proofs, while other models degrade significantly, with Llama 3.3 70B scoring 0% on graduate proofs under human evaluation.

Large language models (LLMs) are increasingly used as self-study assistants in technical disciplines, yet their reliability as mathematical reasoning assistants remains poorly understood. We introduce GTBench, a curriculum-grounded benchmark for evaluating LLMs as mathematical research assistants in graph theory, comprising 63 problems organized into three groups of increasing difficulty: undergraduate definitions and basic properties (Group 1), algorithm tracing and structural reasoning (Group 2), and graduate-level proof construction (Group 3). Problems are sourced from verified academic materials including Diestel's Graph Theory. We evaluate five frontier models -- GPT-5, Claude Sonnet 4.6, Gemini 2.5 Flash-Lite, Llama 3.3 70B, and Mistral Large 3 -- under zero-shot and chain-of-thought prompting, using exact-match and LLM-as-judge evaluation for Groups 1 and 2, and a hybrid human expert and LLM-as-judge protocol for Group 3. Our results reveal a pronounced performance hierarchy: GPT-5 approaches ceiling on Group 1 (95.8% zero-shot) and maintains meaningful accuracy on graduate proofs (82%), while all other models degrade substantially with difficulty, with Llama achieving 0% under human evaluation on Group 3 zero-shot. Failure mode analysis shows that correct algorithm, wrong execution errors dominate Groups 1 and 2, while Group 3 additionally surfaces incomplete reasoning failures and reveals systematic disagreement between human evaluators and the automated judge, particularly on verbose or near-complete proofs (kappa = 0.48-0.83 across human pairs). GTBench provides the first curriculum-grounded evaluation framework for graph-theoretic reasoning in LLMs, with direct implications for the governance of AI tools in mathematical education and scientific research.

View on arXiv PDF

Similar