LGMar 2, 2025

Re-Imagining Multimodal Instruction Tuning: A Representation View

Yiyang Liu, James Chenhao Liang, Ruixiang Tang, Yugyung Lee, Majid Rabbani, Sohail Dianat, Raghuveer Rao, Lifu Huang, Dongfang Liu, Qifan Wang, Cheng Han

arXiv:2503.00723v316 citationsh-index: 17ICLR

Originality Highly original

AI Analysis

This addresses the efficiency and interpretability challenges in multimodal instruction tuning for AI researchers and practitioners, offering a novel method with substantial gains.

The paper tackles the parameter-intensive nature of fine-tuning large multimodal models by introducing Multimodal Representation Tuning (MRT), which directly edits multimodal representations to achieve strong performance with significantly fewer parameters, resulting in a 1580.40 MME score and 0.03% tunable parameters.

Multimodal instruction tuning has proven to be an effective strategy for achieving zero-shot generalization by fine-tuning pre-trained Large Multimodal Models (LMMs) with instruction-following data. However, as the scale of LMMs continues to grow, fully fine-tuning these models has become highly parameter-intensive. Although Parameter-Efficient Fine-Tuning (PEFT) methods have been introduced to reduce the number of tunable parameters, a significant performance gap remains compared to full fine-tuning. Furthermore, existing PEFT approaches are often highly parameterized, making them difficult to interpret and control. In light of this, we introduce Multimodal Representation Tuning (MRT), a novel approach that focuses on directly editing semantically rich multimodal representations to achieve strong performance and provide intuitive control over LMMs. Empirical results show that our method surpasses current state-of-the-art baselines with significant performance gains (e.g., 1580.40 MME score) while requiring substantially fewer tunable parameters (e.g., 0.03% parameters). Additionally, we conduct experiments on editing instrumental tokens within multimodal representations, demonstrating that direct manipulation of these representations enables simple yet effective control over network behavior.

View on arXiv PDF

Similar