CVJan 13

KidVis: Do Multimodal Large Language Models Possess the Visual Perceptual Capabilities of a 6-Year-Old?

Xianfeng Wang, Kaiwei Zhang, Qi Jia, Zijian Chen, Guangtao Zhai, Xiongkuo Min

arXiv:2601.08292v11.5h-index: 49

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of evaluating fundamental visual perception in MLLMs for AI researchers, showing they lack human-like perceptual primitives despite high-level reasoning abilities, which is incremental as it provides a new benchmark but does not propose a novel method to solve the issue.

The paper introduced KidVis, a benchmark based on human visual development to assess six atomic visual capabilities in Multimodal Large Language Models (MLLMs), finding that while human children score 95.32 on average, the top-performing GPT-5 only achieves 67.33, revealing a significant performance gap and a 'Scaling Law Paradox' where increasing parameters does not linearly improve these foundational skills.

While Multimodal Large Language Models (MLLMs) have demonstrated impressive proficiency in high-level reasoning tasks, such as complex diagrammatic interpretation, it remains an open question whether they possess the fundamental visual primitives comparable to human intuition. To investigate this, we introduce KidVis, a novel benchmark grounded in the theory of human visual development. KidVis deconstructs visual intelligence into six atomic capabilities - Concentration, Tracking, Discrimination, Memory, Spatial, and Closure - already possessed by 6-7 year old children, comprising 10 categories of low-semantic-dependent visual tasks. Evaluating 20 state-of-the-art MLLMs against a human physiological baseline reveals a stark performance disparity. Results indicate that while human children achieve a near-perfect average score of 95.32, the state-of-the-art GPT-5 attains only 67.33. Crucially, we observe a "Scaling Law Paradox": simply increasing model parameters fails to yield linear improvements in these foundational visual capabilities. This study confirms that current MLLMs, despite their reasoning prowess, lack the essential physiological perceptual primitives required for generalized visual intelligence.

View on arXiv PDF

Similar