CLMar 16, 2025

Basic Category Usage in Vision Language Models

arXiv:2503.12530v2h-index: 4Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the understanding of cognitive behaviors in AI models for researchers in AI and cognitive science, though it is incremental as it applies known psychological concepts to new models.

The study investigated whether vision-language models (VLMs) exhibit basic-level categorization preferences similar to humans, finding that Llama 3.2 Vision Instruct (11B) and Molmo 7B-D models show consistent preferences, including nuanced effects like biological vs. non-biological distinctions and expert shifts, while expert prompting methods yielded lower accuracy than non-expert methods.

The field of psychology has long recognized a basic level of categorization that humans use when labeling visual stimuli, a term coined by Rosch in 1976. This level of categorization has been found to be used most frequently, to have higher information density, and to aid in visual language tasks with priming in humans. Here, we investigate basic-level categorization in two recently released, open-source vision-language models (VLMs). This paper demonstrates that Llama 3.2 Vision Instruct (11B) and Molmo 7B-D both prefer basic-level categorization consistent with human behavior. Moreover, the models' preferences are consistent with nuanced human behaviors like the biological versus non-biological basic level effects and the well-established expert basic level shift, further suggesting that VLMs acquire complex cognitive categorization behaviors from the human data on which they are trained. We also find our expert prompting methods demonstrate lower accuracy then our non-expert prompting methods, contradicting popular thought regarding the use of expertise prompting methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes