CV CL LGJun 13, 2025

Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs

Xiao Xu, Libo Qin, Wanxiang Che, Min-Yen Kan

arXiv:2506.11515v13.61 citationsh-index: 15Has CodeIEEE transactions on circuits and systems for video technology (Print)

Originality Incremental advance

AI Analysis

This work addresses performance bottlenecks in vision-language models for researchers and practitioners, though it appears incremental as it builds on existing architectures like BridgeTower and LLaVA.

The paper tackles the limitations of BridgeTower in effectively utilizing unimodal representations and handling high-resolution images by proposing Manager, a lightweight plugin that aggregates insights from different levels of pre-trained unimodal experts. ManagerTower outperforms previous baselines on 4 downstream VL tasks, and LLaVA-OV-Manager boosts zero-shot performance across 20 datasets in MLLMs.

Two-Tower Vision--Language Models (VLMs) have demonstrated strong performance across various downstream VL tasks. While BridgeTower further enhances performance by building bridges between encoders, it \textit{(i)} suffers from ineffective layer-by-layer utilization of unimodal representations, \textit{(ii)} restricts the flexible exploitation of different levels of unimodal semantic knowledge, and \textit{(iii)} is limited to the evaluation on traditional low-resolution datasets only with the Two-Tower VLM architecture. In this work, we propose Manager, a lightweight, efficient and effective plugin that adaptively aggregates insights from different levels of pre-trained unimodal experts to facilitate more comprehensive VL alignment and fusion. First, under the Two-Tower VLM architecture, we introduce ManagerTower, a novel VLM that introduces the manager in each cross-modal layer. Whether with or without VL pre-training, ManagerTower outperforms previous strong baselines and achieves superior performance on 4 downstream VL tasks. Moreover, we extend our exploration to the latest Multimodal Large Language Model (MLLM) architecture. We demonstrate that LLaVA-OV-Manager significantly boosts the zero-shot performance of LLaVA-OV across different categories of capabilities, images, and resolutions on 20 downstream datasets, whether the multi-grid algorithm is enabled or not. In-depth analysis reveals that both our manager and the multi-grid algorithm can be viewed as a plugin that improves the visual representation by capturing more diverse visual details from two orthogonal perspectives (depth and width). Their synergy can mitigate the semantic ambiguity caused by the multi-grid algorithm and further improve performance. Code and models are available at https://github.com/LooperXX/ManagerTower.

View on arXiv PDF Code

Similar