CLJan 26

Grounded Concreteness: Human-Like Concreteness Sensitivity in Vision-Language Models

arXiv:2601.18065v1h-index: 2
Originality Incremental advance
AI Analysis

This research addresses the problem of understanding how multimodal training affects language processing for AI researchers, though it is incremental as it builds on existing comparisons between VLMs and LLMs.

The study investigated whether vision-language models (VLMs) develop more human-like sensitivity to linguistic concreteness than text-only large language models (LLMs) when evaluated with text-only prompts, finding that VLMs showed larger gains on concrete inputs, clearer concreteness-structured representations, better alignment with human norms, and different attention patterns.

Do vision--language models (VLMs) develop more human-like sensitivity to linguistic concreteness than text-only large language models (LLMs) when both are evaluated with text-only prompts? We study this question with a controlled comparison between matched Llama text backbones and their Llama Vision counterparts across multiple model scales, treating multimodal pretraining as an ablation on perceptual grounding rather than access to images at inference. We measure concreteness effects at three complementary levels: (i) output behavior, by relating question-level concreteness to QA accuracy; (ii) embedding geometry, by testing whether representations organize along a concreteness axis; and (iii) attention dynamics, by quantifying context reliance via attention-entropy measures. In addition, we elicit token-level concreteness ratings from models and evaluate alignment to human norm distributions, testing whether multimodal training yields more human-consistent judgments. Across benchmarks and scales, VLMs show larger gains on more concrete inputs, exhibit clearer concreteness-structured representations, produce ratings that better match human norms, and display systematically different attention patterns consistent with increased grounding.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes