CVSDApr 5

A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning

arXiv:2604.0399572.8
Predicted impact top 39% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This exposes a critical and underexplored threat in safety-critical applications of MLLMs, such as common-sense reasoning and content moderation, though it is incremental by building on prior unimodal attack research.

The paper tackled the vulnerability of audio-visual multi-modal large language models (MLLMs) to cross-modal typographic attacks, finding that coordinated multi-modal attacks achieve an 83.43% success rate compared to 34.93% for single-modality attacks.

As audio-visual multi-modal large language models (MLLMs) are increasingly deployed in safety-critical applications, understanding their vulnerabilities is crucial. To this end, we introduce Multi-Modal Typography, a systematic study examining how typographic attacks across multiple modalities adversely influence MLLMs. While prior work focuses narrowly on unimodal attacks, we expose the cross-modal fragility of MLLMs. We analyze the interactions between audio, visual, and text perturbations and reveal that coordinated multi-modal attack creates a significantly more potent threat than single-modality attacks (attack success rate = $83.43\%$ vs $34.93\%$).Our findings across multiple frontier MLLMs, tasks, and common-sense reasoning and content moderation benchmarks establishes multi-modal typography as a critical and underexplored attack strategy in multi-modal reasoning. Code and data will be publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes