PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs
This addresses a specific problem in MLLMs where fine-tuning harms text reasoning, offering an incremental improvement for researchers and practitioners in multimodal AI.
The paper tackles the degradation of linguistic reasoning in multimodal large language models (MLLMs) due to multimodal instruction fine-tuning, proposing a training-free plateau-guided model merging method that selectively injects base language model parameters to improve visual grounding, achieving effectiveness demonstrated on five MLLMs across nine benchmarks.
Multimodal Large Language Models (MLLMs) rely on strong linguistic reasoning inherited from their base language models. However, multimodal instruction fine-tuning paradoxically degrades this text's reasoning capability, undermining multimodal performance. To address this issue, we propose a training-free framework to mitigate this degradation. Through layer-wise vision token masking, we reveal a common three-stage pattern in multimodal large language models: early-modal separation, mid-modal alignment, and late-modal degradation. By analyzing the behavior of MLLMs at different stages, we propose a plateau-guided model merging method that selectively injects base language model parameters into MLLMs. Experimental results based on five MLLMs on nine benchmarks demonstrate the effectiveness of our method. Attention-based analysis further reveals that merging shifts attention from diffuse, scattered patterns to focused localization on task-relevant visual regions. Our repository is on https://github.com/wzj1718/PlaM.