CRAICLFeb 20, 2025

How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation

arXiv:2502.14486v11 citationsh-index: 20
Originality Incremental advance
AI Analysis

This addresses the vulnerability of large vision-language models to harmful prompts, providing incremental improvements in defense strategies for AI safety.

This paper tackles the problem of jailbreak attacks on generative models by reframing generation as a binary classification task to assess refusal tendencies, identifying safety shift and harmfulness discrimination as key defense mechanisms, and developing ensemble strategies that improve safety or optimize the safety-helpfulness trade-off, with experiments on datasets like MM-SafetyBench showing effective results.

Jailbreak attacks, where harmful prompts bypass generative models' built-in safety, raise serious concerns about model vulnerability. While many defense methods have been proposed, the trade-offs between safety and helpfulness, and their application to Large Vision-Language Models (LVLMs), are not well understood. This paper systematically examines jailbreak defenses by reframing the standard generation task as a binary classification problem to assess model refusal tendencies for both harmful and benign queries. We identify two key defense mechanisms: safety shift, which increases refusal rates across all queries, and harmfulness discrimination, which improves the model's ability to distinguish between harmful and benign inputs. Using these mechanisms, we develop two ensemble defense strategies-inter-mechanism ensembles and intra-mechanism ensembles-to balance safety and helpfulness. Experiments on the MM-SafetyBench and MOSSBench datasets with LLaVA-1.5 models show that these strategies effectively improve model safety or optimize the trade-off between safety and helpfulness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes