CRCLAug 16, 2025

Mitigating Jailbreaks with Intent-Aware LLMs

arXiv:2508.12072v21 citationsh-index: 13Has Code
Originality Incremental advance
AI Analysis

This addresses a critical safety issue for users of LLMs by mitigating adversarial attacks, though it is an incremental improvement over existing defenses.

The paper tackles the problem of jailbreak attacks on large language models by proposing Intent-FT, a fine-tuning approach that trains models to infer instruction intent, resulting in no attack exceeding a 50% success rate and improved robustness while preserving general capabilities.

Despite extensive safety-tuning, large language models (LLMs) remain vulnerable to jailbreak attacks via adversarially crafted instructions, reflecting a persistent trade-off between safety and task performance. In this work, we propose Intent-FT, a simple and lightweight fine-tuning approach that explicitly trains LLMs to infer the underlying intent of an instruction before responding. By fine-tuning on a targeted set of adversarial instructions, Intent-FT enables LLMs to generalize intent deduction to unseen attacks, thereby substantially improving their robustness. We comprehensively evaluate both parametric and non-parametric attacks across open-source and proprietary models, considering harmfulness from attacks, utility, over-refusal, and impact against white-box threats. Empirically, Intent-FT consistently mitigates all evaluated attack categories, with no single attack exceeding a 50\% success rate -- whereas existing defenses remain only partially effective. Importantly, our method preserves the model's general capabilities and reduces excessive refusals on benign instructions containing superficially harmful keywords. Furthermore, models trained with Intent-FT accurately identify hidden harmful intent in adversarial attacks, and these learned intentions can be effectively transferred to enhance vanilla model defenses. We publicly release our code at https://github.com/wj210/Intent_Jailbreak.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes