Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency
For researchers evaluating AI-human alignment in scene understanding, this work provides a method to quantify semantic gaps in closed-source VLMs.
The paper introduces Counterfactual Semantic Saliency (CSS), a black-box framework to measure object importance in VLMs by causal ablation, and finds that VLMs over-rely on large, central, and salient objects while under-relying on people compared to humans, with size bias being a key driver of model-human divergence.
Evaluating whether large vision-language models (VLMs) align with human perception for high-level semantic scene comprehension remains a challenge. Traditional white-box interpretability methods are inapplicable to closed-source architectures and passive metrics fail to isolate causal features. We introduce Counterfactual Semantic Saliency (CSS). This black-box, model-agnostic framework quantifies the importance of objects by measuring the semantic shift induced by their causal ablation from a scene. To evaluate AI-human semantic alignment, we tested prominent VLMs against a human psychophysics baseline comprising 16,289 valid responses across 307 complex natural scenes and 1,306 high-fidelity counterfactual variants. Our analysis reveals a pervasive scene comprehension gap: models exhibit an overreliance (relative to humans) on large objects (size bias), objects at the center of the image (center bias), and high saliency objects. In contrast, models rely less on people in the scenes than our human participants to describe the images. A model's size bias is a primary driver explaining variations in model-human semantic divergence. Code and data will be available at https://github.com/starsky77/Counterfactual-Semantic-Saliency.