CLCVMay 9, 2023

WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset

arXiv:2305.05432v111 citations
Originality Synthesis-oriented
AI Analysis

This dataset addresses a gap for researchers in multimodal AI by providing structured resources for webpage tasks, though it is incremental as it builds on existing Wikipedia data.

The authors tackled the lack of a comprehensive multimodal dataset for webpage understanding by introducing WikiWeb2M, which retains full page-level images, text, and structure from Wikipedia, enabling tasks like page description generation and contextual image captioning.

Webpages have been a rich resource for language and vision-language tasks. Yet only pieces of webpages are kept: image-caption pairs, long text articles, or raw HTML, never all in one place. Webpage tasks have resultingly received little attention and structured image-text data underused. To study multimodal webpage understanding, we introduce the Wikipedia Webpage 2M (WikiWeb2M) suite; the first to retain the full set of images, text, and structure data available in a page. WikiWeb2M can be used for tasks like page description generation, section summarization, and contextual image captioning.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes