CVJun 22, 2021

DocFormer: End-to-End Transformer for Document Understanding

arXiv:2106.11539v2396 citations
AI Analysis

This addresses the challenge of understanding diverse document layouts for applications in automated processing, representing a strong specific gain rather than a foundational breakthrough.

The paper tackles the problem of Visual Document Understanding (VDU) for varied document formats like forms and receipts by introducing DocFormer, a multi-modal transformer that integrates text, vision, and spatial features, achieving state-of-the-art results on four datasets and sometimes outperforming models four times larger in parameters.

We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes