CV AIOct 23, 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning

Hao Wang, Xiahua Chen, Rui Wang, Chenhui Chu

arXiv:2310.14785v133.2132 citationsh-index: 2

Originality Incremental advance

AI Analysis

This work addresses the challenge of entity recognition in visually-rich documents, which is important for applications like form processing, but it appears incremental as it builds on existing multimodal methods with specific enhancements.

The paper tackled the problem of extracting semantic entities from visually-rich document images by enhancing the model's ability to capture fine-grained visual and layout features, resulting in substantial performance improvements over strong baselines like LayoutLM on benchmark datasets.

Extracting meaningful entities belonging to predefined categories from Visually-rich Form-like Documents (VFDs) is a challenging task. Visual and layout features such as font, background, color, and bounding box location and size provide important cues for identifying entities of the same type. However, existing models commonly train a visual encoder with weak cross-modal supervision signals, resulting in a limited capacity to capture these non-textual features and suboptimal performance. In this paper, we propose a novel \textbf{V}isually-\textbf{A}symmetric co\textbf{N}sisten\textbf{C}y \textbf{L}earning (\textsc{Vancl}) approach that addresses the above limitation by enhancing the model's ability to capture fine-grained visual and layout features through the incorporation of color priors. Experimental results on benchmark datasets show that our approach substantially outperforms the strong LayoutLM series baseline, demonstrating the effectiveness of our approach. Additionally, we investigate the effects of different color schemes on our approach, providing insights for optimizing model performance. We believe our work will inspire future research on multimodal information extraction.

View on arXiv PDF

Similar