Evaluating VisualRAG: Quantifying Cross-Modal Performance in Enterprise Document Understanding
This addresses the need for reliable enterprise AI deployment by providing a quantitative framework to measure trustworthiness in multimodal RAG systems.
The paper tackles the problem of evaluating trustworthiness in multimodal generative AI for enterprise document understanding by introducing a systematic benchmarking framework for VisualRAG systems. The result shows that optimal modality weighting (30% text, 15% image, 25% caption, 30% OCR) improves performance by 57.3% over text-only baselines while maintaining computational efficiency.
Current evaluation frameworks for multimodal generative AI struggle to establish trustworthiness, hindering enterprise adoption where reliability is paramount. We introduce a systematic, quantitative benchmarking framework to measure the trustworthiness of progressively integrating cross-modal inputs such as text, images, captions, and OCR within VisualRAG systems for enterprise document intelligence. Our approach establishes quantitative relationships between technical metrics and user-centric trust measures. Evaluation reveals that optimal modality weighting with weights of 30% text, 15% image, 25% caption, and 30% OCR improves performance by 57.3% over text-only baselines while maintaining computational efficiency. We provide comparative assessments of foundation models, demonstrating their differential impact on trustworthiness in caption generation and OCR extraction-a vital consideration for reliable enterprise AI. This work advances responsible AI deployment by providing a rigorous framework for quantifying and enhancing trustworthiness in multimodal RAG for critical enterprise applications.