Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization
For researchers and practitioners concerned with adversarial robustness of vision-language models, this work demonstrates that transferable multimodal jailbreaks are feasible under an untargeted threat model, challenging prior conclusions.
The authors challenge the assumption that gradient-based universal image jailbreaks on vision-language models lack cross-model transferability, proposing UJEM-KL, a lightweight attack that maximizes entropy at high-entropy decision tokens to flip refusal outcomes. Across three VLMs and two safety benchmarks, UJEM-KL achieves competitive white-box attack success rates and consistently improves transferability, while remaining effective under defenses.
Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. We revisit this conclusion under a strictly untargeted threat model without enforcing a fixed prefix or response pattern. Our preliminary experiment reveals that refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Motivated by this finding, we propose Untargeted Jailbreak via Entropy Maximization(UJEM)-KL, a lightweight attack that maximizes entropy at these decision tokens to flip refusal outcomes, while stabilizing the remaining low-entropy positions to preserve output quality. Across three VLMs and two safety benchmarks, UJEM-KL achieves competitive white-box attack success rates and consistently improves transferability, while remaining effective under representative defenses. Our experimental results indicate that the limited transferability primarily stems from overly constrained optimization objectives.