CVAug 13, 2025

Do Vision Transformers See Like Humans? Evaluating their Perceptual Alignment

arXiv:2508.09850v11 citationsh-index: 25
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of perceptual misalignment in ViTs for applications requiring human-like visual understanding, but it is incremental as it builds on prior findings about model complexity.

The study investigated how Vision Transformers (ViTs) align with human perception on the TID2013 dataset, finding that larger models, repeated image exposure, and stronger data augmentation/regularization reduce perceptual alignment, with minimal impact from increased dataset diversity.

Vision Transformers (ViTs) achieve remarkable performance in image recognition tasks, yet their alignment with human perception remains largely unexplored. This study systematically analyzes how model size, dataset size, data augmentation and regularization impact ViT perceptual alignment with human judgments on the TID2013 dataset. Our findings confirm that larger models exhibit lower perceptual alignment, consistent with previous works. Increasing dataset diversity has a minimal impact, but exposing models to the same images more times reduces alignment. Stronger data augmentation and regularization further decrease alignment, especially in models exposed to repeated training cycles. These results highlight a trade-off between model complexity, training strategies, and alignment with human perception, raising important considerations for applications requiring human-like visual understanding.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes