LG CVMay 7

Towards Self-Explainable Document Visual Question Answering with Chain-of-Explanation Predictions

Kjetil Indrehus, Adrian Duric, Changkyu Choi, Ali Ramezani-Kebrya

arXiv:2605.0605858.3

Predicted impact top 39% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and practitioners needing transparent DocVQA systems, CoExVQA provides a verifiable reasoning process without sacrificing performance.

CoExVQA introduces a self-explainable framework for Document VQA that separates evidence identification, answer localization, and answer decoding into a chain-of-explanation, achieving a 12% ANLS improvement over explainable baselines on PFL-DocVQA.

Document Visual Question Answering (DocVQA) requires vision-language models to reason not only about what information in a document is relevant to a question, but also where the answer is grounded on the page. Existing DocVQA models entangle question-relevant evidence and answer localization and operate largely as black boxes, offering limited means to verify how predictions depend on visual evidence. We propose CoExVQA, a self-explainable DocVQA framework with a grounded reasoning process through a chain-of-explanation design. CoExVQA first identifies question-relevant evidence, then explicitly localizes the answer region, and finally decodes the answer exclusively from the grounded region. Prediction via CoExVQA's chain-of-explanation enables direct inspection and verification of the reasoning process across modalities. Empirical results show that restricting decoding to grounded evidence achieves SotA explainable DocVQA performance on PFL-DocVQA, improving ANLS by 12% over the current explainable baselines while providing transparent and verifiable predictions.

View on arXiv PDF

Similar