HueManity: Probing Fine-Grained Visual Perception in MLLMs
This work addresses a critical gap in visual capabilities for MLLMs, which is incremental as it benchmarks existing models without proposing a new method.
The paper tackles the problem of limited performance in multimodal large language models (MLLMs) on fine-grained visual perception tasks by introducing the HueManity benchmark, which reveals a significant gap with MLLMs achieving only 33.6% accuracy on an easy task and 3% on a hard task compared to near-perfect human and traditional model scores.
Multimodal Large Language Models (MLLMs) excel at high-level visual reasoning, but their performance on nuanced perceptual tasks remains surprisingly limited. We present HueManity, a benchmark designed to assess visual perception in MLLMs. The dataset comprises 83,850 images featuring two-character alphanumeric strings embedded in Ishihara test style dot patterns, challenging models on precise pattern recognition. Our evaluation of nine state-of-the-art MLLMs on HueManity demonstrates a significant performance deficit compared to human and traditional computer vision baselines. The best-performing MLLM achieved a 33.6% accuracy on the numeric `easy' task and a striking 3% on the alphanumeric `hard' task. In contrast, human participants achieved near-perfect scores (100% and 95.6%), and a fine-tuned ResNet50 model reached accuracies of 96.5% and 94.5%. These results highlight a critical gap in the visual capabilities of current MLLMs. Our analysis further explores potential architectural and training-paradigm factors contributing to this perceptual gap in MLLMs. We open-source HueManity dataset and code to foster further research in improving perceptual robustness of MLLMs.