CLCVApr 18, 2022

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Microsoft
arXiv:2204.08387v3790 citationsh-index: 102
Originality Incremental advance
AI Analysis

This addresses the problem of inconsistent pre-training objectives for multimodal models in Document AI, offering a general-purpose solution for both text-centric and image-centric tasks, though it is incremental as it builds on prior LayoutLM versions.

The paper tackles the challenge of multimodal representation learning in Document AI by proposing LayoutLMv3, a pre-trained model with unified text and image masking and a word-patch alignment objective, achieving state-of-the-art performance in tasks like form understanding and document image classification.

Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose \textbf{LayoutLMv3} to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis. The code and models are publicly available at \url{https://aka.ms/layoutlmv3}.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes