SDCVFeb 17, 2025

Rethinking Audio-Visual Adversarial Vulnerability from Temporal and Modality Perspectives

arXiv:2502.11858v32 citationsh-index: 12ICLR
Originality Incremental advance
AI Analysis

This addresses security risks in multi-modal AI systems, offering incremental improvements in adversarial robustness for audio-visual learning.

The paper tackled the problem of adversarial vulnerabilities in audio-visual models by proposing two attacks exploiting temporal redundancy and modality misalignment, and a defense framework that improved robustness and efficiency on the Kinetics-Sounds dataset.

While audio-visual learning equips models with a richer understanding of the real world by leveraging multiple sensory modalities, this integration also introduces new vulnerabilities to adversarial attacks. In this paper, we present a comprehensive study of the adversarial robustness of audio-visual models, considering both temporal and modality-specific vulnerabilities. We propose two powerful adversarial attacks: 1) a temporal invariance attack that exploits the inherent temporal redundancy across consecutive time segments and 2) a modality misalignment attack that introduces incongruence between the audio and visual modalities. These attacks are designed to thoroughly assess the robustness of audio-visual models against diverse threats. Furthermore, to defend against such attacks, we introduce a novel audio-visual adversarial training framework. This framework addresses key challenges in vanilla adversarial training by incorporating efficient adversarial perturbation crafting tailored to multi-modal data and an adversarial curriculum strategy. Extensive experiments in the Kinetics-Sounds dataset demonstrate that our proposed temporal and modality-based attacks in degrading model performance can achieve state-of-the-art performance, while our adversarial training defense largely improves the adversarial robustness as well as the adversarial training efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes