AIMAMay 27

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

arXiv:2605.2889727.8h-index: 6Has Code
AI Analysis

For the NLP and peer review community, this work highlights risks of LLM-assisted review and revision, though the findings are incremental given prior work on LLM review alignment.

The paper evaluates LLM-generated reviews for scientific papers, finding limited alignment with human reviews that varies across prompts and models, and shows that authors can 'game' LLM reviews via iterative revision, achieving statistically significant score increases for up to 35% of papers.

LLM-generated reviews for scientific papers are gaining considerable traction and are even being officially piloted by major conferences. We have to assume that not only reviewers are using LLM-assistance, but also that authors use LLMs to revise their papers before submitting. In this work, we perform empirical experiments on papers from the 2025 ACL Rolling Review (ARR) to evaluate LLM reviews from both the author and the reviewer perspective. First, we identify a limited alignment of LLM reviews with human ones. In the best-case scenario, the alignment is reasonable. However, we also find that LLM-human alignment varies substantially across prompts and models. Finally, we investigate the scenario in which the author uses an iterative draft-revise workflow to improve the submission according to the LLM review. We find that this "gaming" of LLM reviews can be effective in specific scenarios, leading to a statistically significant increase of overall scores for up to 35\% of papers. We publish our code: https://github.com/uhh-hcds/reviewarcade.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes