CLAug 19, 2025

MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation

arXiv:2508.14146v45 citationsh-index: 9Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of automating peer review for researchers and publishers by providing a standardized benchmark, though it is incremental as it builds on existing LLM-based review tasks.

The authors tackled the lack of a unified evaluation benchmark for LLM-based peer review automation, particularly for multimodal content, by proposing MMReview, a comprehensive benchmark spanning 240 papers across 17 domains and 13 tasks, which they validated through experiments on 21 models.

With the rapid growth of academic publications, peer review has become an essential yet time-consuming responsibility within the research community. Large Language Models (LLMs) have increasingly been adopted to assist in the generation of review comments; however, current LLM-based review tasks lack a unified evaluation benchmark to rigorously assess the models' ability to produce comprehensive, accurate, and human-aligned assessments, particularly in scenarios involving multimodal content such as figures and tables. To address this gap, we propose \textbf{MMReview}, a comprehensive benchmark that spans multiple disciplines and modalities. MMReview includes multimodal content and expert-written review comments for 240 papers across 17 research domains within four major academic disciplines: Artificial Intelligence, Natural Sciences, Engineering Sciences, and Social Sciences. We design a total of 13 tasks grouped into four core categories, aimed at evaluating the performance of LLMs and Multimodal LLMs (MLLMs) in step-wise review generation, outcome formulation, alignment with human preferences, and robustness to adversarial input manipulation. Extensive experiments conducted on 16 open-source models and 5 advanced closed-source models demonstrate the thoroughness of the benchmark. We envision MMReview as a critical step toward establishing a standardized foundation for the development of automated peer review systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes