Classroom Final Exam: An Instructor-Tested Reasoning Benchmark
This work addresses the need for better benchmarks to assess reasoning in AI models, particularly for STEM education and research, though it is incremental as it builds on existing benchmarking efforts.
The authors tackled the problem of evaluating reasoning capabilities in large language models by introducing the Classroom Final Exam (CFE) benchmark, which uses authentic university STEM problems and shows that frontier models like Gemini-3.1-pro-preview achieve only 59.69% accuracy, indicating significant room for improvement.
We introduce \CFE{} (\textbf{C}lassroom \textbf{F}inal \textbf{E}xam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. \CFE{} is curated from repeatedly used, authentic university homework and exam problems, together with reference solutions provided by course instructors. \CFE{} presents a significant challenge even for frontier models: the newly released Gemini-3.1-pro-preview achieves an overall accuracy of 59.69\%, while the second-best model, Gemini-3-flash-preview, reaches 55.46\%, leaving considerable room for improvement. Beyond leaderboard results, we perform a diagnostic analysis by decomposing reference solutions into reasoning flows. We find that although frontier models can often answer intermediate sub-questions correctly, they struggle to reliably derive and maintain correct intermediate states throughout multi-step solutions. We further observe that model-generated solutions typically have more reasoning steps than those provided by the instructor, indicating suboptimal step efficiency and a higher risk of error accumulation. The data and code are available at https://github.com/Analogy-AI/CFE_Bench.