CVAICLDec 7, 2022

Hierarchical multimodal transformers for Multi-Page DocVQA

arXiv:2212.05935v2114 citationsh-index: 40
AI Analysis

This work addresses Document Visual Question Answering for multi-page documents, which is more realistic than single-page scenarios but represents an incremental extension of existing methods.

The authors tackled the problem of answering questions from multi-page document images by creating a new dataset (MP-DocVQA) and proposing Hi-VT5, a hierarchical transformer method based on T5, which achieved the ability to answer questions and identify relevant pages in a single stage.

Document Visual Question Answering (DocVQA) refers to the task of answering questions from document images. Existing work on DocVQA only considers single-page documents. However, in real scenarios documents are mostly composed of multiple pages that should be processed altogether. In this work we extend DocVQA to the multi-page scenario. For that, we first create a new dataset, MP-DocVQA, where questions are posed over multi-page documents instead of single pages. Second, we propose a new hierarchical method, Hi-VT5, based on the T5 architecture, that overcomes the limitations of current methods to process long multi-page documents. The proposed method is based on a hierarchical transformer architecture where the encoder summarizes the most relevant information of every page and then, the decoder takes this summarized information to generate the final answer. Through extensive experimentation, we demonstrate that our method is able, in a single stage, to answer the questions and provide the page that contains the relevant information to find the answer, which can be used as a kind of explainability measure.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes