Analyzing Finetuning Representation Shift for Multimodal LLMs Steering
This work addresses interpretability and control challenges in MLLMs, which is crucial for researchers and practitioners dealing with complex AI models, though it is incremental as it builds on existing concept-level analysis methods.
The authors tackled the problem of understanding and interpreting the behavior of multimodal LLMs (MLLMs) during fine-tuning by mapping hidden states to interpretable concepts, revealing concept alterations and biases. They demonstrated that shift vectors can recover fine-tuned concepts through simple additive shifts, enabling applications in model debiasing and safety enforcement.
Multimodal LLMs (MLLMs) have reached remarkable levels of proficiency in understanding multimodal inputs. However, understanding and interpreting the behavior of such complex models is a challenging task, not to mention the dynamic shifts that may occur during fine-tuning, or due to covariate shift between datasets. In this work, we apply concept-level analysis towards MLLM understanding. More specifically, we propose to map hidden states to interpretable visual and textual concepts. This enables us to more efficiently compare certain semantic dynamics, such as the shift from an original and fine-tuned model, revealing concept alteration and potential biases that may occur during fine-tuning. We also demonstrate the use of shift vectors to capture these concepts changes. These shift vectors allow us to recover fine-tuned concepts by applying simple, computationally inexpensive additive concept shifts in the original model. Finally, our findings also have direct applications for MLLM steering, which can be used for model debiasing as well as enforcing safety in MLLM output. All in all, we propose a novel, training-free, ready-to-use framework for MLLM behavior interpretability and control. Our implementation is publicly available.