ELAIPBench: A Benchmark for Expert-Level Artificial Intelligence Paper Understanding
This addresses the gap in assessing LLMs' ability to understand academic papers for researchers and developers, though it is incremental as it builds on existing benchmarking efforts.
The authors tackled the problem of evaluating large language models' (LLMs) deep comprehension of full-length AI research papers by introducing ELAIPBench, a benchmark with 403 multiple-choice questions from 137 papers, and found that the best-performing LLM achieved only 39.95% accuracy, far below human performance.
While large language models (LLMs) excel at many domain-specific tasks, their ability to deeply comprehend and reason about full-length academic papers remains underexplored. Existing benchmarks often fall short of capturing such depth, either due to surface-level question design or unreliable evaluation metrics. To address this gap, we introduce ELAIPBench, a benchmark curated by domain experts to evaluate LLMs' comprehension of artificial intelligence (AI) research papers. Developed through an incentive-driven, adversarial annotation process, ELAIPBench features 403 multiple-choice questions from 137 papers. It spans three difficulty levels and emphasizes non-trivial reasoning rather than shallow retrieval. Our experiments show that the best-performing LLM achieves an accuracy of only 39.95%, far below human performance. Moreover, we observe that frontier LLMs equipped with a thinking mode or a retrieval-augmented generation (RAG) system fail to improve final results-even harming accuracy due to overthinking or noisy retrieval. These findings underscore the significant gap between current LLM capabilities and genuine comprehension of academic papers.