CV AIMay 11

Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization

Mengqi He, Xinyu Tian, Xin Shen, Shu Zou, Jinhong Ni, Zhaoyuan Yang, Weikang Li, Xuesong Li, Jing Zhang

arXiv:2605.1076472.2

AI Analysis

For researchers and practitioners concerned with adversarial robustness of vision-language models, this work demonstrates that transferable multimodal jailbreaks are feasible under an untargeted threat model, challenging prior conclusions.

The authors challenge the assumption that gradient-based universal image jailbreaks on vision-language models lack cross-model transferability, proposing UJEM-KL, a lightweight attack that maximizes entropy at high-entropy decision tokens to flip refusal outcomes. Across three VLMs and two safety benchmarks, UJEM-KL achieves competitive white-box attack success rates and consistently improves transferability, while remaining effective under defenses.

Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. We revisit this conclusion under a strictly untargeted threat model without enforcing a fixed prefix or response pattern. Our preliminary experiment reveals that refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Motivated by this finding, we propose Untargeted Jailbreak via Entropy Maximization(UJEM)-KL, a lightweight attack that maximizes entropy at these decision tokens to flip refusal outcomes, while stabilizing the remaining low-entropy positions to preserve output quality. Across three VLMs and two safety benchmarks, UJEM-KL achieves competitive white-box attack success rates and consistently improves transferability, while remaining effective under representative defenses. Our experimental results indicate that the limited transferability primarily stems from overly constrained optimization objectives.

View on arXiv PDF

Similar