CLOct 16, 2021

MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding

arXiv:2110.08518v276 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of processing interactive digital documents for applications like web content analysis, though it is incremental as it builds on existing multimodal pre-training approaches.

The authors tackled the problem of understanding digital documents with dynamic layouts, such as HTML/XML, by proposing MarkupLM, a pre-training method that jointly learns text and markup information. The model significantly outperformed existing baselines on several document understanding tasks.

Multimodal pre-training with text, layout, and image has made significant progress for Visually Rich Document Understanding (VRDU), especially the fixed-layout documents such as scanned document images. While, there are still a large number of digital documents where the layout information is not fixed and needs to be interactively and dynamically rendered for visualization, making existing layout-based pre-training approaches not easy to apply. In this paper, we propose MarkupLM for document understanding tasks with markup languages as the backbone, such as HTML/XML-based documents, where text and markup information is jointly pre-trained. Experiment results show that the pre-trained MarkupLM significantly outperforms the existing strong baseline models on several document understanding tasks. The pre-trained model and code will be publicly available at https://aka.ms/markuplm.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes