CVFeb 6

MeDocVL: A Visual Language Model for Medical Document Understanding and Parsing

arXiv:2602.06402v1h-index: 3
Originality Incremental advance
AI Analysis

This addresses the problem of reliable medical document parsing for healthcare applications, but appears incremental as it builds on existing vision-language models with specific enhancements.

The paper tackles the problem of medical document OCR, which is challenging due to complex layouts and noisy annotations, by proposing MeDocVL, a vision-language model that achieves state-of-the-art performance on medical invoice benchmarks.

Medical document OCR is challenging due to complex layouts, domain-specific terminology, and noisy annotations, while requiring strict field-level exact matching. Existing OCR systems and general-purpose vision-language models often fail to reliably parse such documents. We propose MeDocVL, a post-trained vision-language model for query-driven medical document parsing. Our framework combines Training-driven Label Refinement to construct high-quality supervision from noisy annotations, with a Noise-aware Hybrid Post-training strategy that integrates reinforcement learning and supervised fine-tuning to achieve robust and precise extraction. Experiments on medical invoice benchmarks show that MeDocVL consistently outperforms conventional OCR systems and strong VLM baselines, achieving state-of-the-art performance under noisy supervision.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes