CVMay 19, 2025

Computer Vision Models Show Human-Like Sensitivity to Geometric and Topological Concepts

arXiv:2505.13281v16.21 citationsh-index: 3CogSci

Originality Incremental advance

AI Analysis

This research addresses the problem of understanding how humans acquire abstract concepts, with implications for cognitive science and AI alignment, though it is incremental in building on prior studies.

The study investigated whether computer vision models can develop human-like sensitivity to geometric and topological concepts through training on large image datasets, finding that transformer-based models achieved the highest accuracy, surpassing young children, and showed strong alignment with children's performance.

With the rapid improvement of machine learning (ML) models, cognitive scientists are increasingly asking about their alignment with how humans think. Here, we ask this question for computer vision models and human sensitivity to geometric and topological (GT) concepts. Under the core knowledge account, these concepts are innate and supported by dedicated neural circuitry. In this work, we investigate an alternative explanation, that GT concepts are learned ``for free'' through everyday interaction with the environment. We do so using computer visions models, which are trained on large image datasets. We build on prior studies to investigate the overall performance and human alignment of three classes of models -- convolutional neural networks (CNNs), transformer-based models, and vision-language models -- on an odd-one-out task testing 43 GT concepts spanning seven classes. Transformer-based models achieve the highest overall accuracy, surpassing that of young children. They also show strong alignment with children's performance, finding the same classes of concepts easy vs. difficult. By contrast, vision-language models underperform their vision-only counterparts and deviate further from human profiles, indicating that naïve multimodality might compromise abstract geometric sensitivity. These findings support the use of computer vision models to evaluate the sufficiency of the learning account for explaining human sensitivity to GT concepts, while also suggesting that integrating linguistic and visual representations might have unpredicted deleterious consequences.

View on arXiv PDF

Similar