CLSep 18, 2025

Red Teaming Multimodal Language Models: Evaluating Harm Across Prompt Modalities and Models

Madison Van Doren, Casey Ford, Emily Dix

arXiv:2509.15478v11 citations

Originality Synthesis-oriented

AI Analysis

It addresses safety risks for users and developers of multimodal AI systems, but is incremental as it benchmarks existing models without proposing new methods.

This study evaluated the harmlessness of four multimodal large language models (GPT-4o, Claude Sonnet 3.5, Pixtral 12B, and Qwen VL Plus) under adversarial prompts, finding significant differences in vulnerability, with Pixtral 12B showing the highest harmful response rate (~62%) and Claude Sonnet 3.5 the most resistant (~10%).

Multimodal large language models (MLLMs) are increasingly used in real world applications, yet their safety under adversarial conditions remains underexplored. This study evaluates the harmlessness of four leading MLLMs (GPT-4o, Claude Sonnet 3.5, Pixtral 12B, and Qwen VL Plus) when exposed to adversarial prompts across text-only and multimodal formats. A team of 26 red teamers generated 726 prompts targeting three harm categories: illegal activity, disinformation, and unethical behaviour. These prompts were submitted to each model, and 17 annotators rated 2,904 model outputs for harmfulness using a 5-point scale. Results show significant differences in vulnerability across models and modalities. Pixtral 12B exhibited the highest rate of harmful responses (~62%), while Claude Sonnet 3.5 was the most resistant (~10%). Contrary to expectations, text-only prompts were slightly more effective at bypassing safety mechanisms than multimodal ones. Statistical analysis confirmed that both model type and input modality were significant predictors of harmfulness. These findings underscore the urgent need for robust, multimodal safety benchmarks as MLLMs are deployed more widely.

View on arXiv PDF

Similar