CVCLMMJun 27, 2022

Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding

arXiv:2206.13155v225 citationsh-index: 62
Originality Incremental advance
AI Analysis

This work addresses the need for better generalization and accuracy in multi-modal document understanding tasks, which is incremental as it builds on existing pre-trained models by improving vision-language interactions.

The paper tackles the problem of vision-language joint representation learning for visually-rich document understanding by proposing Bi-VLDoc, a pre-training paradigm with bidirectional supervision and hybrid-attention, resulting in state-of-the-art performance improvements on benchmarks such as Form Understanding (from 85.14% to 93.44%), Receipt Information Extraction (from 96.01% to 97.84%), and Document Classification (from 96.08% to 97.12%).

Multi-modal document pre-trained models have proven to be very effective in a variety of visually-rich document understanding (VrDU) tasks. Though existing document pre-trained models have achieved excellent performance on standard benchmarks for VrDU, the way they model and exploit the interactions between vision and language on documents has hindered them from better generalization ability and higher accuracy. In this work, we investigate the problem of vision-language joint representation learning for VrDU mainly from the perspective of supervisory signals. Specifically, a pre-training paradigm called Bi-VLDoc is proposed, in which a bidirectional vision-language supervision strategy and a vision-language hybrid-attention mechanism are devised to fully explore and utilize the interactions between these two modalities, to learn stronger cross-modal document representations with richer semantics. Benefiting from the learned informative cross-modal document representations, Bi-VLDoc significantly advances the state-of-the-art performance on three widely-used document understanding benchmarks, including Form Understanding (from 85.14% to 93.44%), Receipt Information Extraction (from 96.01% to 97.84%), and Document Classification (from 96.08% to 97.12%). On Document Visual QA, Bi-VLDoc achieves the state-of-the-art performance compared to previous single model methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes