CLApr 30, 2020

Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO

arXiv:2004.15020v3817 citations
AI Analysis

This dataset extension addresses a gap in multi-modal retrieval research by providing missing similarity judgments, benefiting researchers in representation learning, though it is incremental as it builds on existing datasets.

The authors tackled the limited cross-modal associations in image captioning datasets by introducing Crisscrossed Captions (CxC), an extension of MS-COCO with 267,095 human semantic similarity judgments for intra- and inter-modality pairs, and reported baseline results showing its utility for evaluating multimodal models.

By supporting multi-modal retrieval training and evaluation, image captioning datasets have spurred remarkable progress on representation learning. Unfortunately, datasets have limited cross-modal associations: images are not paired with other images, captions are only paired with other captions of the same image, there are no negative associations and there are missing positive cross-modal associations. This undermines research into how inter-modality learning impacts intra-modality tasks. We address this gap with Crisscrossed Captions (CxC), an extension of the MS-COCO dataset with human semantic similarity judgments for 267,095 intra- and inter-modality pairs. We report baseline results on CxC for strong existing unimodal and multimodal models. We also evaluate a multitask dual encoder trained on both image-caption and caption-caption pairs that crucially demonstrates CxC's value for measuring the influence of intra- and inter-modality learning.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes