CVCRLGJun 25, 2022

Defending Multimodal Fusion Models against Single-Source Adversaries

arXiv:2206.12714v150 citationsh-index: 27
Originality Highly original
AI Analysis

This addresses a critical robustness problem for multimodal AI systems in applications like action recognition and object detection, offering a novel defense against single-source adversaries.

The paper tackles the vulnerability of multimodal fusion models to adversarial attacks on a single modality, showing that standard models fail despite redundant information, and proposes a robust fusion strategy that improves state-of-the-art robustness by up to 48.2% on tasks like object detection without degrading clean data performance.

Beyond achieving high performance across many vision tasks, multimodal models are expected to be robust to single-source faults due to the availability of redundant information between modalities. In this paper, we investigate the robustness of multimodal neural networks against worst-case (i.e., adversarial) perturbations on a single modality. We first show that standard multimodal fusion models are vulnerable to single-source adversaries: an attack on any single modality can overcome the correct information from multiple unperturbed modalities and cause the model to fail. This surprising vulnerability holds across diverse multimodal tasks and necessitates a solution. Motivated by this finding, we propose an adversarially robust fusion strategy that trains the model to compare information coming from all the input sources, detect inconsistencies in the perturbed modality compared to the other modalities, and only allow information from the unperturbed modalities to pass through. Our approach significantly improves on state-of-the-art methods in single-source robustness, achieving gains of 7.8-25.2% on action recognition, 19.7-48.2% on object detection, and 1.6-6.7% on sentiment analysis, without degrading performance on unperturbed (i.e., clean) data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes