CR AI CYJan 14, 2025

Playing Devil's Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models

Abdulkadir Erol, Trilok Padhi, Agnik Saha, Ugur Kursuncu, Mehmet Emin Aktas

arXiv:2501.09039v110.43 citationsh-index: 16Has Code

Originality Incremental advance

AI Analysis

It addresses the problem of toxic content generation in LVLMs for developers and users, but is incremental as it builds on existing red-teaming efforts.

This study systematically examined vulnerabilities in open-source Large Vision-Language Models (LVLMs) like LLaVA and Qwen, finding that adversarial prompts can induce toxic responses with mean rates of 16.13% for toxicity and 9.75% for insulting, highlighting significant safety risks.

The rapid advancement of Large Vision-Language Models (LVLMs) has enhanced capabilities offering potential applications from content creation to productivity enhancement. Despite their innovative potential, LVLMs exhibit vulnerabilities, especially in generating potentially toxic or unsafe responses. Malicious actors can exploit these vulnerabilities to propagate toxic content in an automated (or semi-) manner, leveraging the susceptibility of LVLMs to deception via strategically crafted prompts without fine-tuning or compute-intensive procedures. Despite the red-teaming efforts and inherent potential risks associated with the LVLMs, exploring vulnerabilities of LVLMs remains nascent and yet to be fully addressed in a systematic manner. This study systematically examines the vulnerabilities of open-source LVLMs, including LLaVA, InstructBLIP, Fuyu, and Qwen, using adversarial prompt strategies that simulate real-world social manipulation tactics informed by social theories. Our findings show that (i) toxicity and insulting are the most prevalent behaviors, with the mean rates of 16.13% and 9.75%, respectively; (ii) Qwen-VL-Chat, LLaVA-v1.6-Vicuna-7b, and InstructBLIP-Vicuna-7b are the most vulnerable models, exhibiting toxic response rates of 21.50%, 18.30% and 17.90%, and insulting responses of 13.40%, 11.70% and 10.10%, respectively; (iii) prompting strategies incorporating dark humor and multimodal toxic prompt completion significantly elevated these vulnerabilities. Despite being fine-tuned for safety, these models still generate content with varying degrees of toxicity when prompted with adversarial inputs, highlighting the urgent need for enhanced safety mechanisms and robust guardrails in LVLM development.

View on arXiv PDF

Similar