CL CVNov 9, 2022

Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions

Michele Cafagna, Kees van Deemter, Albert Gatt

arXiv:2211.04971v223.9290 citationsh-index: 12

Originality Synthesis-oriented

AI Analysis

This work addresses the limitation of object-centric image captioning for applications requiring scene understanding, though it is incremental as it builds on existing models and datasets.

The paper tackled the problem of generating scene-level descriptions from images, showing that fine-tuning a state-of-the-art Vision and Language model (VinVL) with a small curated dataset enables it to produce holistic scene descriptions while retaining object-level identification capabilities.

Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the potential of a state-of-the-art Vision and Language model, VinVL, to caption images at the scene level using (1) a novel dataset which pairs images with both object-centric and scene descriptions. Through (2) an in-depth analysis of the effect of the fine-tuning, we show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene; the model acquires a more holistic view of the image compared to when object-centric descriptions are generated. We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.

View on arXiv PDF

Similar