DISCO: Document Intelligence Suite for COmparative Evaluation
This provides empirical guidance for selecting document processing strategies, but it is incremental as it focuses on comparative evaluation rather than introducing new methods.
The paper tackled the problem of evaluating document intelligence systems by introducing DISCO, a suite that tests OCR pipelines and vision-language models on diverse document types, finding that performance varies significantly with document characteristics, with OCR better for handwriting and long documents and VLMs better for multilingual and visually rich layouts.
Document intelligence requires accurate text extraction and reliable reasoning over document content. We introduce \textbf{DISCO}, a \emph{Document Intelligence Suite for COmparative Evaluation}, that evaluates optical character recognition (OCR) pipelines and vision-language models (VLMs) separately on parsing and question answering across diverse document types, including handwritten text, multilingual scripts, medical forms, infographics, and multi-page documents. Our evaluation shows that performance varies substantially across tasks and document characteristics, underscoring the need for complexity-aware approach selection. OCR pipelines are generally more reliable for handwriting and for long or multi-page documents, where explicit text grounding supports text-heavy reasoning, while VLMs perform better on multilingual text and visually rich layouts. Task-aware prompting yields mixed effects, improving performance on some document types while degrading it on others. These findings provide empirical guidance for selecting document processing strategies based on document structure and reasoning demands.