CVFeb 6

MeDocVL: A Visual Language Model for Medical Document Understanding and Parsing

Wenjie Wang, Wei Wu, Ying Liu, Yuan Zhao, Xiaole Lv, Liang Diao, Zengjian Fan, Wenfeng Xie, Ziling Lin, De Shi, Lin Huang, Kaihe Xu

arXiv:2602.06402v11.5h-index: 3

Originality Incremental advance

AI Analysis

This addresses the problem of reliable medical document parsing for healthcare applications, but appears incremental as it builds on existing vision-language models with specific enhancements.

The paper tackles the problem of medical document OCR, which is challenging due to complex layouts and noisy annotations, by proposing MeDocVL, a vision-language model that achieves state-of-the-art performance on medical invoice benchmarks.

Medical document OCR is challenging due to complex layouts, domain-specific terminology, and noisy annotations, while requiring strict field-level exact matching. Existing OCR systems and general-purpose vision-language models often fail to reliably parse such documents. We propose MeDocVL, a post-trained vision-language model for query-driven medical document parsing. Our framework combines Training-driven Label Refinement to construct high-quality supervision from noisy annotations, with a Noise-aware Hybrid Post-training strategy that integrates reinforcement learning and supervised fine-tuning to achieve robust and precise extraction. Experiments on medical invoice benchmarks show that MeDocVL consistently outperforms conventional OCR systems and strong VLM baselines, achieving state-of-the-art performance under noisy supervision.

View on arXiv PDF

Similar