CVMar 4, 2022

DiT: Self-supervised Pre-training for Document Image Transformer

Microsoft
arXiv:2203.02378v3237 citationsh-index: 102
Originality Incremental advance
AI Analysis

This addresses the lack of human-labeled document images for supervised learning, enabling better performance in vision-based Document AI tasks like classification and layout analysis.

The paper tackles the problem of document image understanding by proposing DiT, a self-supervised pre-trained transformer model for Document AI tasks, achieving state-of-the-art results with improvements such as document image classification from 91.11 to 92.69 and document layout analysis from 91.0 to 94.9.

Image Transformer has recently achieved significant progress for natural image understanding, either using supervised (ViT, DeiT, etc.) or self-supervised (BEiT, MAE, etc.) pre-training techniques. In this paper, we propose \textbf{DiT}, a self-supervised pre-trained \textbf{D}ocument \textbf{I}mage \textbf{T}ransformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human-labeled document images. We leverage DiT as the backbone network in a variety of vision-based Document AI tasks, including document image classification, document layout analysis, table detection as well as text detection for OCR. Experiment results have illustrated that the self-supervised pre-trained DiT model achieves new state-of-the-art results on these downstream tasks, e.g. document image classification (91.11 $\rightarrow$ 92.69), document layout analysis (91.0 $\rightarrow$ 94.9), table detection (94.23 $\rightarrow$ 96.55) and text detection for OCR (93.07 $\rightarrow$ 94.29). The code and pre-trained models are publicly available at \url{https://aka.ms/msdit}.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes