CRAICLDec 30, 2025

Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?

arXiv:2512.24044v12 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses the safety risks of LLM deployment by assessing jailbreak attacks in real-world settings, though it is incremental as it builds on existing evaluations by including safety filters.

The study systematically evaluated jailbreak attacks on large language models (LLMs) across the full inference pipeline, including safety filters, and found that nearly all attacks can be detected by at least one filter, suggesting prior assessments overestimated practical success, while also identifying room to improve recall and precision for better protection.

As large language models (LLMs) are increasingly deployed, ensuring their safe use is paramount. Jailbreaking, adversarial prompts that bypass model alignment to trigger harmful outputs, present significant risks, with existing studies reporting high success rates in evading common LLMs. However, previous evaluations have focused solely on the models, neglecting the full deployment pipeline, which typically incorporates additional safety mechanisms like content moderation filters. To address this gap, we present the first systematic evaluation of jailbreak attacks targeting LLM safety alignment, assessing their success across the full inference pipeline, including both input and output filtering stages. Our findings yield two key insights: first, nearly all evaluated jailbreak techniques can be detected by at least one safety filter, suggesting that prior assessments may have overestimated the practical success of these attacks; second, while safety filters are effective in detection, there remains room to better balance recall and precision to further optimize protection and user experience. We highlight critical gaps and call for further refinement of detection accuracy and usability in LLM safety systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes