CV AIFeb 16, 2025

Narrowing Information Bottleneck Theory for Multimodal Image-Text Representations Interpretability

Zhiyu Zhu, Zhibo Jin, Jiayu Zhang, Nan Yang, Jiahao Huang, Jianlong Zhou, Fang Chen

arXiv:2502.14889v15 citationsh-index: 25Has CodeICLR

Originality Highly original

AI Analysis

This addresses the need for safe deployment of multimodal models in real-world applications such as healthcare, offering a novel framework for interpretability.

The paper tackles the problem of improving interpretability in multimodal image-text models like CLIP by proposing the Narrowing Information Bottleneck Theory, which enhances image interpretability by 9%, text interpretability by 58.83%, and speeds up processing by 63.95% compared to state-of-the-art methods.

The task of identifying multimodal image-text representations has garnered increasing attention, particularly with models such as CLIP (Contrastive Language-Image Pretraining), which demonstrate exceptional performance in learning complex associations between images and text. Despite these advancements, ensuring the interpretability of such models is paramount for their safe deployment in real-world applications, such as healthcare. While numerous interpretability methods have been developed for unimodal tasks, these approaches often fail to transfer effectively to multimodal contexts due to inherent differences in the representation structures. Bottleneck methods, well-established in information theory, have been applied to enhance CLIP's interpretability. However, they are often hindered by strong assumptions or intrinsic randomness. To overcome these challenges, we propose the Narrowing Information Bottleneck Theory, a novel framework that fundamentally redefines the traditional bottleneck approach. This theory is specifically designed to satisfy contemporary attribution axioms, providing a more robust and reliable solution for improving the interpretability of multimodal models. In our experiments, compared to state-of-the-art methods, our approach enhances image interpretability by an average of 9%, text interpretability by an average of 58.83%, and accelerates processing speed by 63.95%. Our code is publicly accessible at https://github.com/LMBTough/NIB.

View on arXiv PDF Code

Similar