Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition

Umberto Cappellazzo, Stavros Petridis, Maja Pantic

arXiv:2603.12046v19.2h-index: 90

Predicted impact top 42% in AS · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the need for interpretability in audio-visual speech recognition models, particularly for researchers and practitioners dealing with noisy environments, though it is incremental as it applies existing Shapley attribution methods to a specific domain.

The paper tackled the problem of understanding how audio-visual speech recognition models balance acoustic and visual modalities, especially under noise, by introducing Dr. SHAP-AV to analyze contributions using Shapley values. The results showed that models shift toward visual reliance under noise but maintain high audio contributions even under severe degradation, with SNR being the dominant factor driving modality weighting.

Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AVSR. Through experiments on six models across two benchmarks and varying SNR levels, we introduce three analyses: Global SHAP for overall modality balance, Generative SHAP for contribution dynamics during decoding, and Temporal Alignment SHAP for input-output correspondence. Our findings reveal that models shift toward visual reliance under noise yet maintain high audio contributions even under severe degradation. Modality balance evolves during generation, temporal alignment holds under noise, and SNR is the dominant factor driving modality weighting. These findings expose a persistent audio bias, motivating ad-hoc modality-weighting mechanisms and Shapley-based attribution as a standard AVSR diagnostic.

View on arXiv PDF

Similar