CVDec 8, 2020

StacMR: Scene-Text Aware Cross-Modal Retrieval

Andrés Mafla, Rafael Sampaio de Rezende, Lluís Gómez, Diane Larlus, Dimosthenis Karatzas

arXiv:2012.04329v17.219 citations

Originality Incremental advance

AI Analysis

This work is significant for researchers in cross-modal retrieval, as it highlights and addresses the previously overlooked importance of scene text for improving retrieval performance.

This paper addresses the oversight of scene text in cross-modal retrieval by proposing a new dataset featuring images with scene-text instances. They develop a scene-text aware cross-modal retrieval method that creates specialized representations for caption text and scene text, then unifies them in a shared embedding space.

Recent models for cross-modal retrieval have benefited from an increasingly rich understanding of visual scenes, afforded by scene graphs and object interactions to mention a few. This has resulted in an improved matching between the visual representation of an image and the textual representation of its caption. Yet, current visual representations overlook a key aspect: the text appearing in images, which may contain crucial information for retrieval. In this paper, we first propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances. Then, armed with this dataset, we describe several approaches which leverage scene text, including a better scene-text aware cross-modal retrieval method which uses specialized representations for text from the captions and text from the visual scene, and reconcile them in a common embedding space. Extensive experiments confirm that cross-modal retrieval approaches benefit from scene text and highlight interesting research questions worth exploring further. Dataset and code are available at http://europe.naverlabs.com/stacmr

View on arXiv PDF

Similar