CVCRMar 14, 2025

Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization

arXiv:2503.11750v14 citationsh-index: 44EMNLP
Originality Highly original
AI Analysis

This work addresses safety vulnerabilities in large vision-language models for security researchers, offering a more efficient jailbreaking method that reduces computational costs.

The paper tackles the problem of inefficient adversarial optimization steps in jailbreaking large vision-language models by introducing HKVE, a framework that selectively accepts gradient updates based on attention score distributions, achieving attack success rates of 75.08% on MiniGPT4, 85.84% on LLaVA, and 81.00% on Qwen-VL, with improvements of 20.43%, 21.01%, and 26.43% over existing methods.

In the realm of large vision-language models (LVLMs), adversarial jailbreak attacks serve as a red-teaming approach to identify safety vulnerabilities of these models and their associated defense mechanisms. However, we identify a critical limitation: not every adversarial optimization step leads to a positive outcome, and indiscriminately accepting optimization results at each step may reduce the overall attack success rate. To address this challenge, we introduce HKVE (Hierarchical Key-Value Equalization), an innovative jailbreaking framework that selectively accepts gradient optimization results based on the distribution of attention scores across different layers, ensuring that every optimization step positively contributes to the attack. Extensive experiments demonstrate HKVE's significant effectiveness, achieving attack success rates of 75.08% on MiniGPT4, 85.84% on LLaVA and 81.00% on Qwen-VL, substantially outperforming existing methods by margins of 20.43\%, 21.01\% and 26.43\% respectively. Furthermore, making every step effective not only leads to an increase in attack success rate but also allows for a reduction in the number of iterations, thereby lowering computational costs. Warning: This paper contains potentially harmful example data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes