CIVET: Systematic Evaluation of Understanding in VLMs
This addresses the lack of standardized evaluation for VLMs, providing a tool for researchers to assess model understanding in a controlled manner, though it is incremental as it focuses on evaluation rather than new model development.
The paper tackled the problem of evaluating Vision-Language Models' (VLMs) understanding of object properties and relations by introducing CIVET, a framework for systematic evaluation, and found that current VLMs have limited accuracy, depend on object position, and struggle with basic relations, falling short of human-level performance.
While Vision-Language Models (VLMs) have achieved competitive performance in various tasks, their comprehension of the underlying structure and semantics of a scene remains understudied. To investigate the understanding of VLMs, we study their capability regarding object properties and relations in a controlled and interpretable manner. To this scope, we introduce CIVET, a novel and extensible framework for systematiC evaluatIon Via controllEd sTimuli. CIVET addresses the lack of standardized systematic evaluation for assessing VLMs' understanding, enabling researchers to test hypotheses with statistical rigor. With CIVET, we evaluate five state-of-the-art VLMs on exhaustive sets of stimuli, free from annotation noise, dataset-specific biases, and uncontrolled scene complexity. Our findings reveal that 1) current VLMs can accurately recognize only a limited set of basic object properties; 2) their performance heavily depends on the position of the object in the scene; 3) they struggle to understand basic relations among objects. Furthermore, a comparative evaluation with human annotators reveals that VLMs still fall short of achieving human-level accuracy.