SpatialViz-Bench: An MLLM Benchmark for Spatial Visualization
This addresses a significant gap in evaluating spatial visualization for MLLM researchers, though it is incremental as it builds on existing benchmark efforts.
The authors tackled the problem of insufficient evaluation of spatial visualization in multi-modal Large Language Models (MLLMs) by introducing SpatialViz-Bench, a comprehensive benchmark with 1,180 automatically generated problems across 12 tasks, and found that state-of-the-art MLLMs exhibit wide performance variations and deficiencies, such as difficulty perception misaligned with human intuition and performance degradation from Chain-of-Thought prompting in open-source models.
Humans can directly imagine and manipulate visual images in their minds, a capability known as spatial visualization. While multi-modal Large Language Models (MLLMs) support imagination-based reasoning, spatial visualization remains insufficiently evaluated, typically embedded within broader mathematical and logical assessments. Existing evaluations often rely on IQ tests or math competitions that may overlap with training data, compromising assessment reliability. To this end, we introduce SpatialViz-Bench, a comprehensive multi-modal benchmark for spatial visualization with 12 tasks across 4 sub-abilities, comprising 1,180 automatically generated problems. Our evaluation of 33 state-of-the-art MLLMs not only reveals wide performance variations and demonstrates the benchmark's strong discriminative power, but also uncovers counter-intuitive findings: models show difficulty perception misaligned with human intuition, exhibit dramatic 2Dto-3D performance cliffs, default to formulaic derivation over visualization, and paradoxically suffer performance degradation from Chain-of-Thought prompting in open-source models. Through statistical and qualitative analysis of error types, SpatialViz-Bench demonstrates that state-of-the-art MLLMs continue to exhibit deficiencies in spatial visualization tasks, thereby addressing a significant lacuna in the field. The benchmark data and evaluation code are publicly available.