CLMay 28, 2025

Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation

arXiv:2505.23824v211 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses the peer review crisis by providing an automated tool to detect errors in scientific papers, though it is incremental as it builds on existing LLM capabilities.

The paper tackles the problem of identifying critical errors in scientific papers by using reasoning LLMs as manuscript quality checkers instead of full peer reviewers, validating methods on withdrawn arXiv papers and finding that o3 performed best at modest cost.

Recent advancements in large language models have sparked interest in utilizing them to aid the peer review process of scientific publication amid the peer review crisis. However, having AI models generate full reviews in the same way as human reviewers risks exacerbating the irresponsible use of LLM-generated reviews. As an alternative, we propose adopting LLMs as manuscript quality checkers. We introduce several baseline approaches and an extendable automatic evaluation framework using top reasoning LLMs as judges to tackle the difficulty of recruiting domain experts for manual evaluation. Utilizing papers withdrawn from arXiv, we validated our proposed methods with several leading reasoning LLMs from multiple vendors and assessed their performance and API costs for identifying critical errors and unsoundness problems in scientific papers. o3 exhibited the best problem identification performance among all models at a modest cost. This paper provides insights into document-based scientific understanding/reasoning and lays a foundation for future applications. Our dataset, code, and model outputs are publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes