CV AIMar 29, 2025

Evaluating Compositional Scene Understanding in Multimodal Generative Models

Shuhao Fu, Andrew Jun Lee, Anna Wang, Ida Momennejad, Trevor Bihl, Hongjing Lu, Taylor W. Webb

arXiv:2503.23125v113.16 citationsh-index: 7Has CodeTrans. Mach. Learn. Res.

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of assessing compositional understanding in AI systems for researchers and developers, highlighting incremental progress but significant gaps compared to human abilities.

The paper evaluated the compositional scene understanding capabilities of current multimodal generative models, finding that while they show improvement over previous models, their performance remains significantly below human levels, especially for complex scenes with more than 5 objects and multiple relations.

The visual world is fundamentally compositional. Visual scenes are defined by the composition of objects and their relations. Hence, it is essential for computer vision systems to reflect and exploit this compositionality to achieve robust and generalizable scene understanding. While major strides have been made toward the development of general-purpose, multimodal generative models, including both text-to-image models and multimodal vision-language models, it remains unclear whether these systems are capable of accurately generating and interpreting scenes involving the composition of multiple objects and relations. In this work, we present an evaluation of the compositional visual processing capabilities in the current generation of text-to-image (DALL-E 3) and multimodal vision-language models (GPT-4V, GPT-4o, Claude Sonnet 3.5, QWEN2-VL-72B, and InternVL2.5-38B), and compare the performance of these systems to human participants. The results suggest that these systems display some ability to solve compositional and relational tasks, showing notable improvements over the previous generation of multimodal models, but with performance nevertheless well below the level of human participants, particularly for more complex scenes involving many ($>5$) objects and multiple relations. These results highlight the need for further progress toward compositional understanding of visual scenes.

View on arXiv PDF Code

Similar