CVJun 15, 2025

Native Visual Understanding: Resolving Resolution Dilemmas in Vision-Language Models

Junbo Niu, Yuanhong Zheng, Ziyang Miao, Hejun Dong, Chunjiang Ge, Hao Liang, Ma Lu, Bohan Zeng, Qiahao Zheng, Conghui He, Wentao Zhang

arXiv:2506.12776v116.46 citationsh-index: 8Has Code

Originality Incremental advance

AI Analysis

This work addresses a critical bottleneck in VLM design for real-world applications by providing systematic tools to handle varied visual inputs, though it is incremental as it builds on existing encoding strategies.

The paper tackles the challenge of diverse image resolutions and aspect ratios in Vision-Language Models (VLMs) by introducing RC-Bench, a benchmark for evaluating VLMs under extreme visual conditions, and NativeRes-LLaVA, an open-source training framework for native resolution processing, resulting in significant performance improvements on resolution-centric benchmarks.

Vision-Language Models (VLMs) face significant challenges when dealing with the diverse resolutions and aspect ratios of real-world images, as most existing models rely on fixed, low-resolution inputs. While recent studies have explored integrating native resolution visual encoding to improve model performance, such efforts remain fragmented and lack a systematic framework within the open-source community. Moreover, existing benchmarks fall short in evaluating VLMs under varied visual conditions, often neglecting resolution as a critical factor. To address the "Resolution Dilemma" stemming from both model design and benchmark limitations, we introduce RC-Bench, a novel benchmark specifically designed to systematically evaluate VLM capabilities under extreme visual conditions, with an emphasis on resolution and aspect ratio variations. In conjunction, we propose NativeRes-LLaVA, an open-source training framework that empowers VLMs to effectively process images at their native resolutions and aspect ratios. Based on RC-Bench and NativeRes-LLaVA, we conduct comprehensive experiments on existing visual encoding strategies. The results show that Native Resolution Visual Encoding significantly improves the performance of VLMs on RC-Bench as well as other resolution-centric benchmarks. Code is available at https://github.com/Niujunbo2002/NativeRes-LLaVA.

View on arXiv PDF Code

Similar