CVCLJan 13, 2025

LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models

arXiv:2501.06986v17 citationsh-index: 22Has Code
Originality Incremental advance
AI Analysis

This work addresses a research gap in hybrid MLLMs for enhanced visual understanding, with potential applications in domains like autonomous driving, though it appears incremental as it builds on existing mixture-of-experts approaches.

The paper tackles the problem of effectively integrating diverse vision encoders in multimodal large language models (MLLMs) by proposing LEO, a novel MLLM with a dual-branch vision encoder framework, which outperforms state-of-the-art open-source and hybrid MLLMs on 13 vision-language benchmarks.

Enhanced visual understanding serves as a cornerstone for multimodal large language models (MLLMs). Recent hybrid MLLMs incorporate a mixture of vision experts to address the limitations of using a single vision encoder and excessively long visual tokens. Despite the progress of these MLLMs, a research gap remains in effectively integrating diverse vision encoders. This work explores fusion strategies of visual tokens for hybrid MLLMs, leading to the design of LEO, a novel MLLM with a dual-branch vision encoder framework that incorporates a post-adaptation fusion strategy and adaptive tiling: for each segmented tile of the input images, LEO sequentially interleaves the visual tokens from its two vision encoders. Extensive evaluation across 13 vision-language benchmarks reveals that LEO outperforms state-of-the-art open-source MLLMs and hybrid MLLMs on the majority of tasks. Furthermore, we show that LEO can be adapted to the specialized domain of autonomous driving without altering the model architecture or training recipe, achieving competitive performance compared to existing baselines. The code and model will be publicly available.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes