LGAIMay 6, 2025

Adversarial Attacks in Multimodal Systems: A Practitioner's Survey

arXiv:2505.03084v12 citationsh-index: 3Has CodeCOMPSAC
Originality Synthesis-oriented
AI Analysis

It helps machine learning practitioners understand and mitigate vulnerabilities when deploying open-source multimodal models, though it is incremental as a survey paper.

This paper addresses the lack of a practitioner-focused survey on adversarial attacks in multimodal systems by summarizing the threat landscape across text, image, video, and audio modalities, providing an overview of attack types and their evolution.

The introduction of multimodal models is a huge step forward in Artificial Intelligence. A single model is trained to understand multiple modalities: text, image, video, and audio. Open-source multimodal models have made these breakthroughs more accessible. However, considering the vast landscape of adversarial attacks across these modalities, these models also inherit vulnerabilities of all the modalities, and ultimately, the adversarial threat amplifies. While broad research is available on possible attacks within or across these modalities, a practitioner-focused view that outlines attack types remains absent in the multimodal world. As more Machine Learning Practitioners adopt, fine-tune, and deploy open-source models in real-world applications, it's crucial that they can view the threat landscape and take the preventive actions necessary. This paper addresses the gap by surveying adversarial attacks targeting all four modalities: text, image, video, and audio. This survey provides a view of the adversarial attack landscape and presents how multimodal adversarial threats have evolved. To the best of our knowledge, this survey is the first comprehensive summarization of the threat landscape in the multimodal world.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes