CLMar 27, 2025

CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers?

Jiefu Ou, William Gantt Walden, Kate Sanders, Zhengping Jiang, Kaiser Sun, Jeffrey Cheng, William Jurayj, Miriam Wanner, Shaobo Liang, Candice Morgan, Seunghoon Han, Weiqi Wang

arXiv:2503.21717v112.05 citationsh-index: 15Has CodeEMNLP

Originality Incremental advance

AI Analysis

This addresses the challenge of ensuring automated peer reviews are sound and grounded, which is important for researchers and reviewers, though it is incremental as it builds on existing benchmarking efforts.

The authors tackled the problem of evaluating how well LLMs can critique scientific papers by grounding their critiques in the papers' claims, and they introduced CLAIMCHECK, an annotated dataset from NeurIPS submissions and reviews to benchmark LLMs on claim-centric tasks, finding that state-of-the-art LLMs underperform humans on most tasks.

A core part of scientific peer review involves providing expert critiques that directly assess the scientific claims a paper makes. While it is now possible to automatically generate plausible (if generic) reviews, ensuring that these reviews are sound and grounded in the papers' claims remains challenging. To facilitate LLM benchmarking on these challenges, we introduce CLAIMCHECK, an annotated dataset of NeurIPS 2023 and 2024 submissions and reviews mined from OpenReview. CLAIMCHECK is richly annotated by ML experts for weakness statements in the reviews and the paper claims that they dispute, as well as fine-grained labels of the validity, objectivity, and type of the identified weaknesses. We benchmark several LLMs on three claim-centric tasks supported by CLAIMCHECK, requiring models to (1) associate weaknesses with the claims they dispute, (2) predict fine-grained labels for weaknesses and rewrite the weaknesses to enhance their specificity, and (3) verify a paper's claims with grounded reasoning. Our experiments reveal that cutting-edge LLMs, while capable of predicting weakness labels in (2), continue to underperform relative to human experts on all other tasks.

View on arXiv PDF Code

Similar