CL CVMar 25, 2024

Visually Guided Generative Text-Layout Pre-training for Document Intelligence

Zhiming Mao, Haoli Bai, Lu Hou, Jiansheng Wei, Xin Jiang, Qun Liu, Kam-Fai Wong

arXiv:2403.16516v217.736 citationsh-index: 19Has CodeNAACL

Originality Incremental advance

AI Analysis

This work addresses visual document understanding for applications such as OCR and downstream tasks, but it is incremental as it builds on existing pre-training techniques with a novel multi-segment scheme.

The authors tackled the problem of visual document understanding by proposing ViTLP, a pre-training method that generates interleaved text and layout sequences from document images, achieving competitive performance on benchmark tasks like information extraction, classification, and question answering.

Prior study shows that pre-training techniques can boost the performance of visual document understanding (VDU), which typically requires models to gain abilities to perceive and reason both document texts and layouts (e.g., locations of texts and table-cells). To this end, we propose visually guided generative text-layout pre-training, named ViTLP. Given a document image, the model optimizes hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence. In addition, to address the limitation of processing long documents by Transformers, we introduce a straightforward yet effective multi-segment generative pre-training scheme, facilitating ViTLP to process word-intensive documents of any length. ViTLP can function as a native OCR model to localize and recognize texts of document images. Besides, ViTLP can be effectively applied to various downstream VDU tasks. Extensive experiments show that ViTLP achieves competitive performance over existing baselines on benchmark VDU tasks, including information extraction, document classification, and document question answering.

View on arXiv PDF Code

Similar