CLFeb 24, 2025

Mind the Blind Spots: A Focus-Level Evaluation Framework for LLM Reviews

arXiv:2502.17086v421 citationsh-index: 8EMNLP
Originality Incremental advance
AI Analysis

This addresses the need for systematic evaluation of LLM reviews in scientific peer review, though it is incremental as it builds on existing surface- and content-level methods.

The paper tackled the problem of evaluating LLM-generated peer reviews by introducing a focus-level framework that measures attention distribution across predefined facets, revealing that off-the-shelf LLMs are biased towards technical validity and overlook novelty assessment compared to human experts.

Peer review underpins scientific progress, but it is increasingly strained by reviewer shortages and growing workloads. Large Language Models (LLMs) can automatically draft reviews now, but determining whether LLM-generated reviews are trustworthy requires systematic evaluation. Researchers have evaluated LLM reviews at either surface-level (e.g., BLEU and ROUGE) or content-level (e.g., specificity and factual accuracy). Yet it remains uncertain whether LLM-generated reviews attend to the same critical facets that human experts weigh -- the strengths and weaknesses that ultimately drive an accept-or-reject decision. We introduce a focus-level evaluation framework that operationalizes the focus as a normalized distribution of attention across predefined facets in paper reviews. Based on the framework, we developed an automatic focus-level evaluation pipeline based on two sets of facets: target (e.g., problem, method, and experiment) and aspect (e.g., validity, clarity, and novelty), leveraging 676 paper reviews (https://figshare.com/s/d5adf26c802527dd0f62) from OpenReview that consists of 3,657 strengths and weaknesses identified from human experts. The comparison of focus distributions between LLMs and human experts showed that the off-the-shelf LLMs consistently have a more biased focus towards examining technical validity while significantly overlooking novelty assessment when criticizing papers.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes