CLFeb 20, 2025

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

arXiv:2502.14739v445 citationsh-index: 23
Originality Incremental advance
AI Analysis

This addresses the need for broader evaluation of LLMs in specialized domains beyond mainstream fields, though it is incremental as it extends existing benchmarking efforts.

The authors tackled the problem of evaluating large language models (LLMs) across specialized graduate disciplines by creating SuperGPQA, a benchmark covering 285 fields, and found that current models like DeepSeek-R1 achieve only 61.82% accuracy, indicating a significant gap toward artificial general intelligence.

Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e.g., the reasoning-focused model DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes