From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models
This addresses a foundational issue for AI safety and robustness by revealing a shared failure mode, suggesting joint strategies, though it is incremental in linking two known vulnerabilities.
The paper tackled the problem of vulnerabilities in large foundation models, specifically hallucinations and jailbreak attacks, by proposing a unified theoretical framework and showing that mitigation techniques for one can reduce the other, with empirical validation on models like LLaVA-1.5 and MiniGPT-4.
Large foundation models (LFMs) are susceptible to two distinct vulnerabilities: hallucinations and jailbreak attacks. While typically studied in isolation, we observe that defenses targeting one often affect the other, hinting at a deeper connection. We propose a unified theoretical framework that models jailbreaks as token-level optimization and hallucinations as attention-level optimization. Within this framework, we establish two key propositions: (1) \textit{Similar Loss Convergence} - the loss functions for both vulnerabilities converge similarly when optimizing for target-specific outputs; and (2) \textit{Gradient Consistency in Attention Redistribution} - both exhibit consistent gradient behavior driven by shared attention dynamics. We validate these propositions empirically on LLaVA-1.5 and MiniGPT-4, showing consistent optimization trends and aligned gradients. Leveraging this connection, we demonstrate that mitigation techniques for hallucinations can reduce jailbreak success rates, and vice versa. Our findings reveal a shared failure mode in LFMs and suggest that robustness strategies should jointly address both vulnerabilities.