CVCLJun 3

Would you still call this Dax? Novel Visual References in VLMs and Humans

arXiv:2606.0540954.6
AI Analysis

For researchers studying visual concept learning in VLMs, this work provides a benchmark and reveals limitations in in-context learning of novel concepts.

The paper introduces the Novel Visual References Dataset (NVRD) with 19,176 images across 90 novel visual concepts to study how VLMs map novel visual references to language, finding that models struggle to acquire concepts contradicting prior knowledge and overgeneralize compared to humans.

Vision-language models (VLMs), like human learners, are frequently exposed to new visual concepts, but how they map novel visual references to language after exposure remains largely underexplored, particularly when those references contradict prior knowledge from pre-training. To study this, we present the Novel Visual References Dataset (NVRD): 19,176 images spanning 90 visual concepts across different levels of visual novelty, each with up to 20 increasingly perturbed versions of the original object to probe generalization. Unlike prior work on visual augmentations of familiar concepts, NVRD comprises entirely novel, open-ended stimuli constructed from scratch, mirroring how humans encounter genuinely new concepts. We evaluate 3 open- and 2 closed-source models alongside 2,400 human judgments for direct human-model comparison, and find that (i) models struggle to acquire novel concepts in-context when they contradict prior knowledge, and (ii) while models and humans show correlated sensitivity to visual perturbations, models significantly overgeneralize, extending learned labels to stimuli that humans reject. We contribute NVRD as a corpus and benchmark for research on visual concept learning in both humans and machines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes