Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models

arXiv:2605.2640989.6

AI Analysis

It addresses the practical challenge of efficiently evaluating and mitigating jailbreak vulnerabilities across many generative models, reducing the need for per-configuration testing.

The paper formalizes a behavioral geometry framework for predicting jailbreak susceptibility and transferring defenses across a population of models, achieving 0.94 AUPRC with 98% fewer probes and outperforming same-provider defense transfer by 2%.

Evaluating and mitigating a generative system's susceptibility to jailbreak attacks is critical to its safe deployment. Given the number of deployable systems, full per-configuration evaluation and optimization is impractical. In this paper, we formalize the behavioral geometry of a population of models that, by leveraging previously evaluated and defended models, supports both efficient susceptibility prediction and effective defense transfer across a population. We apply the framework to 79 models spanning 24 providers and to 100 system configurations of a single base model. Simple methods that use the behavioral geometry reach an AUPRC of $0.94$ for susceptibility detection with $\approx98\%$ fewer probes relative to a full evaluation. Using the behavioral geometry to select which model to transfer an optimized defense from outperforms same-provider assignment ($+2\%$, $p = 0.03$) at no additional probe cost, with a set of three models sufficient to cover the population. Results are robust to hyperparameter selection and judge.

View on arXiv PDF

Similar