Real5-OmniDocBench: A Full-Scale Physical Reconstruction Benchmark for Robust Document Parsing in the Wild
This benchmark addresses the problem of evaluating VLM performance on physical documents for researchers and developers, revealing that the 'reality gap' in document parsing is significant.
This paper introduces Real5-OmniDocBench, a benchmark that physically reconstructs 1,355 images from OmniDocBench v1.5 across five real-world scenarios: Scanning, Warping, Screen-Photography, Illumination, and Skew. This allows for rigorous attribution of performance degradation in Vision-Language Models (VLMs) to specific real-world factors.
While Vision-Language Models (VLMs) achieve near-perfect scores on digital document benchmarks like OmniDocBench, their performance in the unpredictable physical world remains largely unknown due to the lack of controlled yet realistic evaluations. We introduce Real5-OmniDocBench, the first benchmark that performs a full-scale, one-to-one physical reconstruction of the entire OmniDocBench v1.5 (1,355 images) across five critical real-world scenarios: Scanning, Warping, Screen-Photography, Illumination, and Skew. Unlike prior benchmark that either lack digital correspondence or employ partial sampling, our complete ground-truth mapping enables, for the first time, rigorous factor-wise attribution of performance degradation-allowing us to pinpoint whether failures stem from geometric distortions, optical artifacts, or model limitations. Our benchmark establishes a challenging new standard for the community, demonstrating that the 'reality gap' in document parsing is far from closed, and provides a diagnostic tool to guide the development of truly resilient document intelligence.