CVCLJun 27, 2025

COOCO -- Common Objects Out-of-Context -- Semantic Violation in Scenes: Investigating Multimodal Context in Referential Communication

arXiv:2506.22274v11 citationsh-index: 7Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses how VLMs handle semantic violations in scenes, which is important for improving multimodal AI systems, but it is incremental as it builds on existing VLM research with a new dataset and analysis.

The paper tackles the problem of whether Vision-Language Models (VLMs) rely on scene context for object reference by introducing the COOCO dataset, finding that models adaptively use context based on semantic congruence and noise levels, with attention analysis showing increased focus on targets in mid-layers under moderate noise.

Natural scenes provide us with rich contexts for object recognition and reference. In particular, knowing what type of scene one is looking at generates expectations about which objects will occur, and what their spatial configuration should be. Do Vision-Language Models (VLMs) learn to rely on scene contexts in a similar way, when generating references to objects? To address this question, we introduce the \textit{Common Objects Out-of-Context (COOCO)} dataset and test to what extent VLMs rely on scene context to refer to objects under different degrees of scene-object congruency, and different perturbations. Our findings show that models leverage scene context adaptively, depending on both the semantic relatedness between object and scene and the level of noise. In particular, models rely more on context under high target-scene congruence or when objects are degraded. Attention analysis reveals that successful object categorisation involves increased focus on the target in mid-level layers, especially under moderate noise, suggesting that VLMs dynamically balance local and contextual information for reference generation. We make our dataset, code and models available at \href{https://github.com/cs-nlp-uu/scenereg}{https://github.com/cs-nlp-uu/scenereg}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes