AIMay 7

Conceal, Reconstruct, Jailbreak: Exploiting the Reconstruction-Concealment Tradeoff in MLLMs

Md Farhamdur Reza, Richeng Jin, Tianfu Wu, Huaiyu Dai

arXiv:2605.0570998.7h-index: 6Has Code

AI Analysis

For MLLM safety researchers, this work reveals a fundamental vulnerability where the model's own reconstruction ability can be exploited to bypass safety filters.

The paper identifies a reconstruction-concealment tradeoff in intent-obfuscation jailbreak attacks on MLLMs, showing that existing methods fail to balance it. The proposed character-removed variants and keyword-related distractor images achieve up to 30% higher attack success rates across multiple models.

Intent-obfuscation-based jailbreak attacks on multimodal large language models (MLLMs) transform a harmful query into a concealed multimodal input to bypass safety mechanisms. We show that such attacks are governed by a \emph{reconstruction--concealment tradeoff}: the transformed input must hide harmful intent from safety filters while remaining recoverable enough for the victim model to reconstruct the original request. Through a reconstruction analysis of three representative black-box methods, we find that existing transformations struggle to balance this tradeoff, limiting their effectiveness. In contrast, we show that character-removed variants achieve a better balance. Building on this, we propose \emph{concealment-aware variant construction}, which greedily selects character-removed variants that are low in harmful-keyword alignment and mutually diverse, and instantiates them through five modality-aware prompting strategies. We further introduce \emph{keyword-related distractor images} that depict the harmful keyword in diverse contexts, providing more effective auxiliary visual context than generic distractor images. Experiments across closed-source and open-source MLLMs show the proposed strategies outperform strong baselines, revealing an underexplored vulnerability: a model's own reconstruction ability can be exploited to recover hidden harmful intent and produce unsafe responses.

View on arXiv PDF

Similar