CVAICLAug 19, 2025

RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

AI2
arXiv:2508.13968v25 citationsh-index: 25
Originality Synthesis-oriented
AI Analysis

This work addresses a gap in spatial reasoning for MLLMs, which is incremental as it focuses on a specific benchmark task.

The paper tackles the problem of evaluating Multimodal Large Language Models (MLLMs) on identifying image rotations (0°, 90°, 180°, 270°), finding that state-of-the-art models like GPT-5 and Gemini-2.5-Pro do not reliably distinguish between 90° and 270° rotations, with only small improvements from auxiliary information or prompting.

We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0°, 90°, 180°, and 270°. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench -- a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information -- including captions, depth maps, and more -- or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0°) images, while certain models are able to identify upside-down (180°) images. None can reliably distinguish between 90° and 270°. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models' ability to distinguish 90° and 270° rotations, despite substantially improving the identification of 180° images. Together, these results reveal a significant gap between MLLMs' spatial reasoning capabilities and human perception in identifying rotation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes