CVJun 22, 2021

DocFormer: End-to-End Transformer for Document Understanding

Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, R. Manmatha

arXiv:2106.11539v236.1396 citations

Originality Highly original

AI Analysis

This addresses the challenge of understanding diverse document layouts for applications in automated processing, representing a strong specific gain rather than a foundational breakthrough.

The paper tackles the problem of Visual Document Understanding (VDU) for varied document formats like forms and receipts by introducing DocFormer, a multi-modal transformer that integrates text, vision, and spatial features, achieving state-of-the-art results on four datasets and sometimes outperforming models four times larger in parameters.

We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).

View on arXiv PDF

Similar