CLCVLGSep 21, 2025

FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions

arXiv:2509.17177v21 citationsh-index: 11
Originality Synthesis-oriented
AI Analysis

This work addresses the need for reliable evaluation benchmarks in AI, though it appears incremental as it builds on existing evaluation frameworks.

The paper introduces ROME, a new benchmark for evaluating large reasoning models on automatically verifiable textual and visual questions, and reports preliminary findings from a contamination-free evaluation of current models.

We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: https://flageval-baai.github.io/LRM-Eval/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes