CVAILGMay 31, 2025

HueManity: Probing Fine-Grained Visual Perception in MLLMs

arXiv:2506.03194v411 citationsh-index: 7Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses a critical gap in visual capabilities for MLLMs, which is incremental as it benchmarks existing models without proposing a new method.

The paper tackles the problem of limited performance in multimodal large language models (MLLMs) on fine-grained visual perception tasks by introducing the HueManity benchmark, which reveals a significant gap with MLLMs achieving only 33.6% accuracy on an easy task and 3% on a hard task compared to near-perfect human and traditional model scores.

Multimodal Large Language Models (MLLMs) excel at high-level visual reasoning, but their performance on nuanced perceptual tasks remains surprisingly limited. We present HueManity, a benchmark designed to assess visual perception in MLLMs. The dataset comprises 83,850 images featuring two-character alphanumeric strings embedded in Ishihara test style dot patterns, challenging models on precise pattern recognition. Our evaluation of nine state-of-the-art MLLMs on HueManity demonstrates a significant performance deficit compared to human and traditional computer vision baselines. The best-performing MLLM achieved a 33.6% accuracy on the numeric `easy' task and a striking 3% on the alphanumeric `hard' task. In contrast, human participants achieved near-perfect scores (100% and 95.6%), and a fine-tuned ResNet50 model reached accuracies of 96.5% and 94.5%. These results highlight a critical gap in the visual capabilities of current MLLMs. Our analysis further explores potential architectural and training-paradigm factors contributing to this perceptual gap in MLLMs. We open-source HueManity dataset and code to foster further research in improving perceptual robustness of MLLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes