CVAINov 29, 2024

DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness

arXiv:2412.00151v27 citationsh-index: 24Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for trustworthy and interpretable AI in high-stakes document analysis applications, though it appears incremental as it builds on existing MLLM techniques.

The paper tackles the problem of answer localization in Document Visual Question Answering by introducing DLaVA, a training-free pipeline that uses Multimodal Large Language Models for zero-shot localization, achieving competitive performance with significantly lower computational complexity and enhanced accuracy and reliability.

Document Visual Question Answering (VQA) demands robust integration of text detection, recognition, and spatial reasoning to interpret complex document layouts. In this work, we introduce DLaVA, a novel, training-free pipeline that leverages Multimodal Large Language Models (MLLMs) for zero-shot answer localization in order to improve trustworthiness, interpretability, and explainability. By leveraging an innovative OCR-free approach that organizes text regions with unique bounding box IDs, the proposed method preserves spatial contexts without relying on iterative OCR or chain-of-thought reasoning, thus substantially reducing the computational complexity. We further enhance the evaluation protocol by integrating Intersection over Union (IoU) metrics alongside Average Normalized Levenshtein Similarity (ANLS), thereby ensuring that not only textual accuracy is considered, but spatial accuracy is taken into account, ultimately reducing the risks of AI hallucinations and improving trustworthiness. Experiments on benchmark datasets demonstrate competitive performance compared to state-of-the-art techniques, with significantly lower computational complexity and enhanced accuracies and reliability for high-stakes applications. The code and datasets utilized in this study for DLaVA are accessible at: https://github.com/ahmad-shirazi/AnnotMLLM.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes