CVFeb 21, 2025

MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing

Matvey Skripkin, Elizaveta Goncharova, Dmitrii Tarasov, Andrey Kuznetsov

arXiv:2502.15381v11 citationsh-index: 6

Originality Incremental advance

AI Analysis

This work addresses performance enhancement in domain-specific vision-language processing tasks, representing an incremental improvement over single-encoder methods.

The paper tackles the problem of variability in specialized vision encoders for multimodal language models by proposing MOVE, a mixture-of-vision-encoders approach that automatically routes inputs to the most appropriate encoder, achieving competitive accuracy across benchmarks like ChartQA, MMBench, and MMMU without complex image slicing.

Multimodal language models (MLMs) integrate visual and textual information by coupling a vision encoder with a large language model through the specific adapter. While existing approaches commonly rely on a single pre-trained vision encoder, there is a great variability of specialized encoders that can boost model's performance in distinct domains. In this work, we propose MOVE (Mixture of Vision Encoders) a simple yet effective approach to leverage multiple pre-trained encoders for specialized multimodal tasks. MOVE automatically routes inputs to the most appropriate encoder among candidates such as Unichat, InternViT, and Texify, thereby enhancing performance across a diverse set of benchmarks, including ChartQA, MMBench, and MMMU. Experimental results demonstrate that MOVE achieves competitive accuracy without incurring the complexities of image slicing for high-resolution images.

View on arXiv PDF

Similar