LGOct 24, 2025

Gen-Review: A Large-scale Dataset of AI-Generated (and Human-written) Peer Reviews

Luca Demetrio, Giovanni Apruzzese, Kathrin Grosse, Pavel Laskov, Emil Lupu, Vera Rimmer, Philine Widmer

arXiv:2510.21192v19.41 citationsh-index: 5

Originality Synthesis-oriented

AI Analysis

This dataset addresses the problem of understanding LLM impacts on scientific peer reviewing for researchers and editorial boards, though it is incremental as it builds on prior work by providing a larger dataset.

The authors tackled the lack of a comprehensive dataset for studying AI-generated peer reviews by creating GenReview, a large-scale dataset of 81K LLM-written reviews for ICLR submissions from 2018-2025, which reveals biases and detection capabilities of LLMs in reviewing.

How does the progressive embracement of Large Language Models (LLMs) affect scientific peer reviewing? This multifaceted question is fundamental to the effectiveness -- as well as to the integrity -- of the scientific process. Recent evidence suggests that LLMs may have already been tacitly used in peer reviewing, e.g., at the 2024 International Conference of Learning Representations (ICLR). Furthermore, some efforts have been undertaken in an attempt to explicitly integrate LLMs in peer reviewing by various editorial boards (including that of ICLR'25). To fully understand the utility and the implications of LLMs' deployment for scientific reviewing, a comprehensive relevant dataset is strongly desirable. Despite some previous research on this topic, such dataset has been lacking so far. We fill in this gap by presenting GenReview, the hitherto largest dataset containing LLM-written reviews. Our dataset includes 81K reviews generated for all submissions to the 2018--2025 editions of the ICLR by providing the LLM with three independent prompts: a negative, a positive, and a neutral one. GenReview is also linked to the respective papers and their original reviews, thereby enabling a broad range of investigations. To illustrate the value of GenReview, we explore a sample of intriguing research questions, namely: if LLMs exhibit bias in reviewing (they do); if LLM-written reviews can be automatically detected (so far, they can); if LLMs can rigorously follow reviewing instructions (not always) and whether LLM-provided ratings align with decisions on paper acceptance or rejection (holds true only for accepted papers). GenReview can be accessed at the following link: https://anonymous.4open.science/r/gen_review.

View on arXiv PDF

Similar