CV AIMar 30

MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios

Zhang Li, Zhibo Lin, Qiang Liu, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiajun Song, Jiarui Zhang, Xiang Bai, Yuliang Liu

arXiv:2603.2813050.42 citationsh-index: 5Has Code

Predicted impact top 2% in CV · last 90 daysOriginality Synthesis-oriented

AI Analysis

This addresses the problem of performance imbalances in document parsing for researchers and practitioners working with diverse languages and real-world conditions, though it is incremental as it builds on existing parsing methods by providing a new benchmark.

The authors tackled the lack of a systematic benchmark for multilingual document parsing by introducing MDPBench, which includes 3,400 document images across 17 languages and varied conditions, and found that open-source models suffer performance drops of up to 17.8% on photographed documents and 14.0% on non-Latin scripts compared to closed-source models.

We introduce Multilingual Document Parsing Benchmark, the first benchmark for multilingual digital and photographed document parsing. Document parsing has made remarkable strides, yet almost exclusively on clean, digital, well-formatted pages in a handful of dominant languages. No systematic benchmark exists to evaluate how models perform on digital and photographed documents across diverse scripts and low-resource languages. MDPBench comprises 3,400 document images spanning 17 languages, diverse scripts, and varied photographic conditions, with high-quality annotations produced through a rigorous pipeline of expert model labeling, manual correction, and human verification. To ensure fair comparison and prevent data leakage, we maintain separate public and private evaluation splits. Our comprehensive evaluation of both open-source and closed-source models uncovers a striking finding: while closed-source models (notably Gemini3-Pro) prove relatively robust, open-source alternatives suffer dramatic performance collapse, particularly on non-Latin scripts and real-world photographed documents, with an average drop of 17.8% on photographed documents and 14.0% on non-Latin scripts. These results reveal significant performance imbalances across languages and conditions, and point to concrete directions for building more inclusive, deployment-ready parsing systems. Source available at https://github.com/Yuliang-Liu/MultimodalOCR.

View on arXiv PDF Code

Similar