AToMiC: An Image/Text Retrieval Test Collection to Support Multimedia Content Creation
This provides a more realistic testbed for researchers in cross-modal retrieval, though it is incremental as it builds on existing vision-language pretrained models.
The paper tackles the problem of oversimplified image-text retrieval datasets by introducing the AToMiC dataset, which leverages hierarchical structures and diverse multimedia content from Wikipedia to support realistic multimedia content creation tasks, and validates it through baseline retrieval experiments.
This paper presents the AToMiC (Authoring Tools for Multimedia Content) dataset, designed to advance research in image/text cross-modal retrieval. While vision-language pretrained transformers have led to significant improvements in retrieval effectiveness, existing research has relied on image-caption datasets that feature only simplistic image-text relationships and underspecified user models of retrieval tasks. To address the gap between these oversimplified settings and real-world applications for multimedia content creation, we introduce a new approach for building retrieval test collections. We leverage hierarchical structures and diverse domains of texts, styles, and types of images, as well as large-scale image-document associations embedded in Wikipedia. We formulate two tasks based on a realistic user model and validate our dataset through retrieval experiments using baseline models. AToMiC offers a testbed for scalable, diverse, and reproducible multimedia retrieval research. Finally, the dataset provides the basis for a dedicated track at the 2023 Text Retrieval Conference (TREC), and is publicly available at https://github.com/TREC-AToMiC/AToMiC.