CVLGOct 5, 2020

VisualWordGrid: Information Extraction From Scanned Documents Using A Multimodal Approach

arXiv:2010.02358v527 citations
Originality Incremental advance
AI Analysis

This work addresses information extraction from scanned documents, which is incremental as it improves upon existing Chargrid and Wordgrid models.

The authors tackled the problem of extracting fields from scanned documents by introducing a multimodal approach that encodes textual, visual, and layout information into a 3-axis tensor, achieving higher performance compared to recent state-of-the-art methods on public and private datasets.

We introduce a novel approach for scanned document representation to perform field extraction. It allows the simultaneous encoding of the textual, visual and layout information in a 3-axis tensor used as an input to a segmentation model. We improve the recent Chargrid and Wordgrid \cite{chargrid} models in several ways, first by taking into account the visual modality, then by boosting its robustness in regards to small datasets while keeping the inference time low. Our approach is tested on public and private document-image datasets, showing higher performances compared to the recent state-of-the-art methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes