AI CL CVNov 27, 2024

Cross-modal Information Flow in Multimodal Large Language Models

Zhi Zhang, Srishti Yadav, Fengze Han, Ekaterina Shutova

arXiv:2411.18620v228.554 citationsh-index: 4Has CodeCVPR

Originality Incremental advance

AI Analysis

This provides insights into multimodal processing mechanisms for researchers, though it is incremental as it builds on existing MLLM frameworks.

The study investigated how visual and linguistic information interact in multimodal large language models (MLLMs) for visual question answering, finding that integration occurs in two distinct stages: lower layers transfer general visual features to question tokens, and middle layers transfer object-specific visual information to relevant token positions.

The recent advancements in auto-regressive multimodal large language models (MLLMs) have demonstrated promising progress for vision-language tasks. While there exists a variety of studies investigating the processing of linguistic information within large language models, little is currently known about the inner working mechanism of MLLMs and how linguistic and visual information interact within these models. In this study, we aim to fill this gap by examining the information flow between different modalities -- language and vision -- in MLLMs, focusing on visual question answering. Specifically, given an image-question pair as input, we investigate where in the model and how the visual and linguistic information are combined to generate the final prediction. Conducting experiments with a series of models from the LLaVA series, we find that there are two distinct stages in the process of integration of the two modalities. In the lower layers, the model first transfers the more general visual features of the whole image into the representations of (linguistic) question tokens. In the middle layers, it once again transfers visual information about specific objects relevant to the question to the respective token positions of the question. Finally, in the higher layers, the resulting multimodal representation is propagated to the last position of the input sequence for the final prediction. Overall, our findings provide a new and comprehensive perspective on the spatial and functional aspects of image and language processing in the MLLMs, thereby facilitating future research into multimodal information localization and editing. Our code and collected dataset are released here: https://github.com/FightingFighting/cross-modal-information-flow-in-MLLM.git.

View on arXiv PDF Code

Similar