CLMay 17, 2025

When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research

Guijin Son, Jiwoo Hong, Honglu Fan, Heejeong Nam, Hyunwoo Ko, Seungwon Lim, Jinyeop Song, Jinha Choi, Gonçalo Paulo, Youngjae Yu, Stella Biderman

arXiv:2505.11855v118.818 citationsh-index: 8

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of automating academic verification for researchers and publishers, but it is incremental as it benchmarks existing LLMs on a new dataset without proposing a novel method.

The paper tackles the problem of using large language models (LLMs) as automated verifiers for scientific manuscripts, introducing the SPOT benchmark with 83 papers and 91 errors, and finds that state-of-the-art LLMs achieve at best 21.1% recall and 6.1% precision, indicating poor reliability.

Recent advances in large language models (LLMs) have fueled the vision of automated scientific discovery, often called AI Co-Scientists. To date, prior work casts these systems as generative co-authors responsible for crafting hypotheses, synthesizing code, or drafting manuscripts. In this work, we explore a complementary application: using LLMs as verifiers to automate the \textbf{academic verification of scientific manuscripts}. To that end, we introduce SPOT, a dataset of 83 published papers paired with 91 errors significant enough to prompt errata or retraction, cross-validated with actual authors and human annotators. Evaluating state-of-the-art LLMs on SPOT, we find that none surpasses 21.1\% recall or 6.1\% precision (o3 achieves the best scores, with all others near zero). Furthermore, confidence estimates are uniformly low, and across eight independent runs, models rarely rediscover the same errors, undermining their reliability. Finally, qualitative analysis with domain experts reveals that even the strongest models make mistakes resembling student-level misconceptions derived from misunderstandings. These findings highlight the substantial gap between current LLM capabilities and the requirements for dependable AI-assisted academic verification.

View on arXiv PDF

Similar