CL CVMay 9, 2023

WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset

Andrea Burns, Krishna Srinivasan, Joshua Ainslie, Geoff Brown, Bryan A. Plummer, Kate Saenko, Jianmo Ni, Mandy Guo

arXiv:2305.05432v13.311 citationsh-index: 75

Originality Synthesis-oriented

AI Analysis

This dataset addresses a gap for researchers in multimodal AI by providing structured resources for webpage tasks, though it is incremental as it builds on existing Wikipedia data.

The authors tackled the lack of a comprehensive multimodal dataset for webpage understanding by introducing WikiWeb2M, which retains full page-level images, text, and structure from Wikipedia, enabling tasks like page description generation and contextual image captioning.

Webpages have been a rich resource for language and vision-language tasks. Yet only pieces of webpages are kept: image-caption pairs, long text articles, or raw HTML, never all in one place. Webpage tasks have resultingly received little attention and structured image-text data underused. To study multimodal webpage understanding, we introduce the Wikipedia Webpage 2M (WikiWeb2M) suite; the first to retain the full set of images, text, and structure data available in a page. WikiWeb2M can be used for tasks like page description generation, section summarization, and contextual image captioning.

View on arXiv PDF

Similar