IR AI CV HC LGJun 19, 2025

Evaluating VisualRAG: Quantifying Cross-Modal Performance in Enterprise Document Understanding

arXiv:2506.21604v13.6h-index: 8

Originality Incremental advance

AI Analysis

This addresses the need for reliable enterprise AI deployment by providing a quantitative framework to measure trustworthiness in multimodal RAG systems.

The paper tackles the problem of evaluating trustworthiness in multimodal generative AI for enterprise document understanding by introducing a systematic benchmarking framework for VisualRAG systems. The result shows that optimal modality weighting (30% text, 15% image, 25% caption, 30% OCR) improves performance by 57.3% over text-only baselines while maintaining computational efficiency.

Current evaluation frameworks for multimodal generative AI struggle to establish trustworthiness, hindering enterprise adoption where reliability is paramount. We introduce a systematic, quantitative benchmarking framework to measure the trustworthiness of progressively integrating cross-modal inputs such as text, images, captions, and OCR within VisualRAG systems for enterprise document intelligence. Our approach establishes quantitative relationships between technical metrics and user-centric trust measures. Evaluation reveals that optimal modality weighting with weights of 30% text, 15% image, 25% caption, and 30% OCR improves performance by 57.3% over text-only baselines while maintaining computational efficiency. We provide comparative assessments of foundation models, demonstrating their differential impact on trustworthiness in caption generation and OCR extraction-a vital consideration for reliable enterprise AI. This work advances responsible AI deployment by providing a rigorous framework for quantifying and enhancing trustworthiness in multimodal RAG for critical enterprise applications.

View on arXiv PDF

Similar