CV CLJun 27, 2025

COOCO -- Common Objects Out-of-Context -- Semantic Violation in Scenes: Investigating Multimodal Context in Referential Communication

Filippo Merlo, Ece Takmaz, Wenkai Chen, Albert Gatt

arXiv:2506.22274v16.21 citationsh-index: 7Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses how VLMs handle semantic violations in scenes, which is important for improving multimodal AI systems, but it is incremental as it builds on existing VLM research with a new dataset and analysis.

The paper tackles the problem of whether Vision-Language Models (VLMs) rely on scene context for object reference by introducing the COOCO dataset, finding that models adaptively use context based on semantic congruence and noise levels, with attention analysis showing increased focus on targets in mid-layers under moderate noise.

Natural scenes provide us with rich contexts for object recognition and reference. In particular, knowing what type of scene one is looking at generates expectations about which objects will occur, and what their spatial configuration should be. Do Vision-Language Models (VLMs) learn to rely on scene contexts in a similar way, when generating references to objects? To address this question, we introduce the \textit{Common Objects Out-of-Context (COOCO)} dataset and test to what extent VLMs rely on scene context to refer to objects under different degrees of scene-object congruency, and different perturbations. Our findings show that models leverage scene context adaptively, depending on both the semantic relatedness between object and scene and the level of noise. In particular, models rely more on context under high target-scene congruence or when objects are degraded. Attention analysis reveals that successful object categorisation involves increased focus on the target in mid-level layers, especially under moderate noise, suggesting that VLMs dynamically balance local and contextual information for reference generation. We make our dataset, code and models available at \href{https://github.com/cs-nlp-uu/scenereg}{https://github.com/cs-nlp-uu/scenereg}.

View on arXiv PDF Code

Similar