LG AIMar 4

Feature-level Interaction Explanations in Multimodal Transformers

Yeji Kim, Housam Khalifa Bashier Babiker, Mi-Young Kim, Randy Goebel

arXiv:2603.13326h-index: 6

AI Analysis

This addresses the need for better interpretability in multimodal AI systems, particularly for researchers and practitioners using Transformers, though it is incremental as it builds on existing explainable AI methods.

The paper tackled the problem of explaining how different modalities jointly support decisions in multimodal Transformers by introducing Feature-level I2MoE, which explicitly separates unique, synergistic, and redundant evidence at the feature level, resulting in more interaction-specific importance patterns across three benchmarks (MMIMDb, ENRICO, and MMHS150K) and showing that removing identified interaction pairs degrades performance more than random pairs.

Multimodal Transformers often produce predictions without clarifying how different modalities jointly support a decision. Most existing multimodal explainable AI (MXAI) methods extend unimodal saliency to multimodal backbones, highlighting important tokens or patches within each modality, but they rarely pinpoint which cross-modal feature pairs provide complementary evidence (synergy) or serve as reliable backups (redundancy). We present Feature-level I2MoE (FL-I2MoE), a structured Mixture-of-Experts layer that operates directly on token/patch sequences from frozen pretrained encoders and explicitly separates unique, synergistic, and redundant evidence at the feature level. We further develop an expert-wise explanation pipeline that combines attribution with top-K% masking to assess faithfulness, and we introduce Monte Carlo interaction probes to quantify pairwise behavior: the Shapley Interaction Index (SII) to score synergistic pairs and a redundancy-gap score to capture substitutable (redundant) pairs. Across three benchmarks (MMIMDb, ENRICO, and MMHS150K), FL-I2MoE yields more interactionspecific and concentrated importance patterns than a dense Transformer with the same encoders. Finally, pair-level masking shows that removing pairs ranked by SII or redundancy-gap degrades performance more than masking randomly chosen pairs under the same budget, supporting that the identified interactions are causally relevant.

View on arXiv PDF

Similar