CLAIAug 29, 2025

Multi-Modal Vision vs. Text-Based Parsing: Benchmarking LLM Strategies for Invoice Processing

arXiv:2509.04469v13 citationsh-index: 27Has Code
Originality Synthesis-oriented
AI Analysis

This work provides insights for selecting models and strategies in automated document processing, but it is incremental as it benchmarks existing methods on new data.

The paper benchmarks eight multi-modal LLMs on invoice datasets, finding that direct image processing generally outperforms structured parsing approaches, with performance varying by model and document type.

This paper benchmarks eight multi-modal large language models from three families (GPT-5, Gemini 2.5, and open-source Gemma 3) on three diverse openly available invoice document datasets using zero-shot prompting. We compare two processing strategies: direct image processing using multi-modal capabilities and a structured parsing approach converting documents to markdown first. Results show native image processing generally outperforms structured approaches, with performance varying across model types and document characteristics. This benchmark provides insights for selecting appropriate models and processing strategies for automated document systems. Our code is available online.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes