CVMay 11, 2025

Visual Instruction Tuning with Chain of Region-of-Interest

Yixin Chen, Shuai Zhang, Boran Han, Bernie Wang

Amazon

arXiv:2505.06840v16.22 citationsh-index: 25

Originality Highly original

AI Analysis

This addresses the problem of high computational costs for high-resolution image processing in multimodal AI, offering a more efficient method for researchers and practitioners, though it is incremental as it builds on existing visual instruction tuning approaches.

The paper tackles the computational burden of high-resolution images in multimodal large language models by proposing Chain of Region-of-Interest (CoRoI), which identifies and prioritizes informative regions to enhance visual comprehension without processing full images, resulting in superior performance on 11 benchmarks, including outperforming LLaVA-NeXT and proprietary models like Gemini Pro 1.0 and GPT-4V on specific tasks.

High-resolution (HR) images are pivotal for enhancing the recognition and understanding capabilities of multimodal large language models (MLLMs). However, directly increasing image resolution can significantly escalate computational demands. In this study, we propose a method called Chain of Region-of-Interest (CoRoI) for Visual Instruction Tuning, aimed at alleviating the computational burden associated with high-resolution images for MLLMs. Drawing inspiration from the selective nature of the human visual system, we recognize that not all regions within high-resolution images carry equal importance. CoRoI seeks to identify and prioritize the most informative regions, thereby enhancing multimodal visual comprehension and recognition while circumventing the need for processing lengthy HR image tokens. Through extensive experiments on 11 benchmarks, we validate the efficacy of CoRoI across varying sizes, ranging from 7B to 34B in parameters. Our models consistently demonstrate superior performance across diverse multimodal benchmarks and tasks. Notably, our method outperforms LLaVA-NeXT on almost all benchmarks and our finetuned 34B model surpasses proprietary methods like Gemini Pro 1.0 on six benchmarks, as well as outperforming GPT-4V on MMB, SEED-I, and MME.

View on arXiv PDF

Similar