CVAISep 25, 2025

Learning to Look: Cognitive Attention Alignment with Vision-Language Models

arXiv:2509.21247v12 citationsh-index: 6
Originality Highly original
AI Analysis

This addresses the issue of unreliable decision-making in vision models for researchers and practitioners, offering a scalable alternative to annotation-heavy methods.

The paper tackles the problem of CNNs exploiting superficial correlations by proposing a framework that uses vision-language models to generate semantic attention maps automatically, achieving state-of-the-art performance on ColorMNIST and competitive results on DecoyMNIST with improved generalization and reduced shortcut reliance.

Convolutional Neural Networks (CNNs) frequently "cheat" by exploiting superficial correlations, raising concerns about whether they make predictions for the right reasons. Inspired by cognitive science, which highlights the role of attention in robust human perception, recent methods have sought to guide model attention using concept-based supervision and explanation regularization. However, these techniques depend on labor-intensive, expert-provided annotations, limiting their scalability. We propose a scalable framework that leverages vision-language models to automatically generate semantic attention maps using natural language prompts. By introducing an auxiliary loss that aligns CNN attention with these language-guided maps, our approach promotes more reliable and cognitively plausible decision-making without manual annotation. Experiments on challenging datasets, ColoredMNIST and DecoyMNIST, show that our method achieves state-of-the-art performance on ColorMNIST and remains competitive with annotation-heavy baselines on DecoyMNIST, demonstrating improved generalization, reduced shortcut reliance, and model attention that better reflects human intuition.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes