AIOct 31, 2025
CombiGraph-Vis: A Curated Multimodal Olympiad Benchmark for Discrete Mathematical ReasoningHamed Mahdavi, Pouria Mahdavinia, Alireza Farhadi et al.
State-of-the-art (SOTA) LLMs have progressed from struggling on proof-based Olympiad problems to solving most of the IMO 2025 problems, with leading systems reportedly handling 5 of 6 problems. Given this progress, we assess how well these models can grade proofs: detecting errors, judging their severity, and assigning fair scores beyond binary correctness. We study proof-analysis capabilities using a corpus of 90 Gemini 2.5 Pro-generated solutions that we grade on a 1-4 scale with detailed error annotations, and on MathArena solution sets for IMO/USAMO 2025 scored on a 0-7 scale. Our analysis shows that models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned. To address this, we introduce agentic workflows that extract and analyze reference solutions and automatically derive problem-specific rubrics for a multi-step grading process. We instantiate and compare different design choices for the grading workflows, and evaluate their trade-offs. Across our annotated corpus and MathArena, our proposed workflows achieve higher agreement with human grades and more consistent handling of partial credit across metrics. We release all code, data, and prompts/logs to facilitate future research.
AIApr 1, 2025
Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad MathematicsHamed Mahdavi, Alireza Hashemi, Majid Daliri et al.
Recent advances in large language models (LLMs) have shown impressive progress in mathematical reasoning tasks. However, current evaluation benchmarks predominantly focus on the accuracy of final answers, often overlooking the crucial logical rigor for mathematical problem solving. The claim that state-of-the-art LLMs can solve Math Olympiad-level problems requires closer examination. To explore this, we conducted both qualitative and quantitative human evaluations of proofs generated by LLMs, and developed a schema for automatically assessing their reasoning capabilities. Our study reveals that current LLMs fall significantly short of solving challenging Olympiad-level problems and frequently fail to distinguish correct mathematical reasoning from clearly flawed solutions. Our analyses demonstrate that the occasional correct final answers provided by LLMs often result from pattern recognition or heuristic shortcuts rather than genuine mathematical reasoning. These findings underscore the substantial gap between LLM performance and human expertise in advanced mathematical reasoning and highlight the importance of developing benchmarks that prioritize the soundness of the reasoning used to arrive at an answer rather than the mere correctness of the final answers.
QUANT-PHApr 29
MLMC-qDRIFT: Multilevel Variance Reduction for Randomized Quantum Hamiltonian SimulationPegah Mohammadipour, Xiantao Li
Simulating quantum dynamics is one of the central applications of quantum computing. For Hamiltonians written as a sum of many terms, deterministic Trotter--Suzuki product formulas can require applying a large number of term-wise evolutions at each time step, leading to high circuit costs for large or dense systems. Randomized methods such as qDRIFT offer an alternative: each step samples only one Hamiltonian term, giving a circuit depth with no explicit dependence on the number of terms. However, when qDRIFT is used for observable estimation, high precision requires many independent random circuit realizations, resulting in a total gate complexity that scales as $\mathcal{O}(\varepsilon^{-3})$. We introduce a multilevel Monte Carlo framework for qDRIFT that reduces this sampling overhead. The method constructs a hierarchy of qDRIFT estimators with increasing circuit depths and couples adjacent levels by sharing their random Hamiltonian-term samples. This coupling makes the variance of the level differences decay with depth, allowing most samples to be taken on cheaper, coarse circuits and only a few on expensive, fine circuits. We prove that the resulting MLMC-qDRIFT estimator reduces the total gate complexity for fixed-precision observable estimation from the standard qDRIFT scaling $\mathcal{O}(\varepsilon^{-3})$ to $\mathcal{O}(\varepsilon^{-2}\log^2(1/\varepsilon))$, while preserving qDRIFT's lack of explicit dependence on the number of Hamiltonian terms. Numerical experiments for spin-chain dynamics confirm the predicted variance decay and demonstrate the practical gate-count savings of the multilevel construction.
AIOct 10, 2025
RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic WorkflowsHamed Mahdavi, Pouria Mahdavinia, Samira Malek et al.
State-of-the-art (SOTA) LLMs have progressed from struggling on proof-based Olympiad problems to solving most of the IMO 2025 problems, with leading systems reportedly handling 5 of 6 problems. Given this progress, we assess how well these models can grade proofs: detecting errors, judging their severity, and assigning fair scores beyond binary correctness. We study proof-analysis capabilities using a corpus of 90 Gemini 2.5 Pro-generated solutions that we grade on a 1-4 scale with detailed error annotations, and on MathArena solution sets for IMO/USAMO 2025 scored on a 0-7 scale. Our analysis shows that models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned. To address this, we introduce agentic workflows that extract and analyze reference solutions and automatically derive problem-specific rubrics for a multi-step grading process. We instantiate and compare different design choices for the grading workflows, and evaluate their trade-offs. Across our annotated corpus and MathArena, our proposed workflows achieve higher agreement with human grades and more consistent handling of partial credit across metrics. We release all code, data, and prompts/logs to facilitate future research.