CV AIDec 31, 2025

UR-Bench: A Benchmark for Multi-Hop Reasoning over Ultra-High-Resolution Images

Siqi Li, Xinyu Cai, Jianbiao Mei, Nianchen Deng, Pinlong Cai, Licheng Wen, Yufan Shen, Xuemeng Yang, Botian Shi, Yong Liu

arXiv:2601.08748v12.81 citationsh-index: 17

Originality Synthesis-oriented

AI Analysis

This addresses the problem of limited visual complexity in existing VQA benchmarks for researchers and developers working on multimodal AI, though it is incremental as it builds on existing MLLM capabilities.

The authors tackled the lack of benchmarks for evaluating multimodal large language models on ultra-high-resolution images by introducing UR-Bench, a benchmark with images up to gigapixels and multi-hop reasoning questions, and they demonstrated the effectiveness of their proposed agent-based framework in improving performance.

Recent multimodal large language models (MLLMs) show strong capabilities in visual-language reasoning, yet their performance on ultra-high-resolution imagery remains largely unexplored. Existing visual question answering (VQA) benchmarks typically rely on medium-resolution data, offering limited visual complexity. To bridge this gap, we introduce Ultra-high-resolution Reasoning Benchmark (UR-Bench), a benchmark designed to evaluate the reasoning capabilities of MLLMs under extreme visual information. UR-Bench comprises two major categories, Humanistic Scenes and Natural Scenes, covering four subsets of ultra-high-resolution images with distinct spatial structures and data sources. Each subset contains images ranging from hundreds of megapixels to gigapixels, accompanied by questions organized into three levels, enabling evaluation of models' reasoning capabilities in ultra-high-resolution scenarios. We further propose an agent-based framework in which a language model performs reasoning by invoking external visual tools. In addition, we introduce Semantic Abstraction and Retrieval tools that enable more efficient processing of ultra-high-resolution images. We evaluate state-of-the-art models using both an end-to-end MLLMs and our agent-based framework, demonstrating the effectiveness of our framework.

View on arXiv PDF

Similar