Can audio-visual integration strengthen robustness under multimodal attacks?
This addresses the problem of securing multimodal AI systems against attacks, which is incremental as it builds on existing adversarial robustness research in audio-visual learning.
The paper systematically studied the robustness of audio-visual models under multimodal adversarial attacks, finding that audio-visual integration can decrease robustness rather than strengthen it, and proposed a defense method that improves invulnerability without significantly sacrificing clean performance.
In this paper, we propose to make a systematic study on machines multisensory perception under attacks. We use the audio-visual event recognition task against multimodal adversarial attacks as a proxy to investigate the robustness of audio-visual learning. We attack audio, visual, and both modalities to explore whether audio-visual integration still strengthens perception and how different fusion mechanisms affect the robustness of audio-visual models. For interpreting the multimodal interactions under attacks, we learn a weakly-supervised sound source visual localization model to localize sounding regions in videos. To mitigate multimodal attacks, we propose an audio-visual defense approach based on an audio-visual dissimilarity constraint and external feature memory banks. Extensive experiments demonstrate that audio-visual models are susceptible to multimodal adversarial attacks; audio-visual integration could decrease the model robustness rather than strengthen under multimodal attacks; even a weakly-supervised sound source visual localization model can be successfully fooled; our defense method can improve the invulnerability of audio-visual networks without significantly sacrificing clean model performance.