CV AI LGNov 7, 2022

On Web-based Visual Corpus Construction for Visual Document Understanding

Donghyun Kim, Teakgyu Hong, Moonbin Yim, Yoonsik Kim, Geewook Kim

arXiv:2211.03256v28.16 citationsh-index: 7Has Code

Originality Incremental advance

AI Analysis

This addresses a data bottleneck for researchers in visual document understanding, particularly for non-Latin languages, though it is incremental as it builds on existing web-based data collection methods.

The paper tackles the limited availability of visual corpora for visual document understanding by proposing Webvicob, a dataset generator that constructs multilingual corpora from Wikipedia HTML dumps, resulting in a 13% improvement on DocVQA Task 3 using 1 million images compared to a larger dataset.

In recent years, research on visual document understanding (VDU) has grown significantly, with a particular emphasis on the development of self-supervised learning methods. However, one of the significant challenges faced in this field is the limited availability of publicly accessible visual corpora or extensive collections of images with detailed text annotations, particularly for non-Latin or resource-scarce languages. To address this challenge, we propose Web-based Visual Corpus Builder (Webvicob), a dataset generator engine capable of constructing large-scale, multilingual visual corpora from raw Wikipedia HTML dumps. Our experiments demonstrate that the data generated by Webvicob can be used to train robust VDU models that perform well on various downstream tasks, such as DocVQA and post-OCR parsing. Furthermore, when using a dataset of 1 million images generated by Webvicob, we observed an improvement of over 13% on the DocVQA Task 3 compared to a dataset of 11 million images from the IIT-CDIP. The implementation of our engine is publicly available on https://github.com/clovaai/webvicob

View on arXiv PDF Code

Similar