CV LGOct 5, 2020

VisualWordGrid: Information Extraction From Scanned Documents Using A Multimodal Approach

Mohamed Kerroumi, Othmane Sayem, Aymen Shabou

arXiv:2010.02358v59.127 citations

Originality Incremental advance

AI Analysis

This work addresses information extraction from scanned documents, which is incremental as it improves upon existing Chargrid and Wordgrid models.

The authors tackled the problem of extracting fields from scanned documents by introducing a multimodal approach that encodes textual, visual, and layout information into a 3-axis tensor, achieving higher performance compared to recent state-of-the-art methods on public and private datasets.

We introduce a novel approach for scanned document representation to perform field extraction. It allows the simultaneous encoding of the textual, visual and layout information in a 3-axis tensor used as an input to a segmentation model. We improve the recent Chargrid and Wordgrid \cite{chargrid} models in several ways, first by taking into account the visual modality, then by boosting its robustness in regards to small datasets while keeping the inference time low. Our approach is tested on public and private document-image datasets, showing higher performances compared to the recent state-of-the-art methods.

View on arXiv PDF

Similar