LGAINov 24, 2025

Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM

arXiv:2511.18721v21 citations
Originality Incremental advance
AI Analysis

This work addresses the critical challenge of securing LLMs against jailbreaking attacks by offering practitioners more realistic and actionable safety guarantees, though it is incremental as it builds on SmoothLLM.

The paper tackled the unrealistic 'k-unstable' assumption in SmoothLLM's jailbreaking defense by introducing a probabilistic '(k, ε)-unstable' framework, resulting in a data-informed lower bound that provides more trustworthy safety certificates for diverse attacks like GCG and PAIR.

The SmoothLLM defense provides a certification guarantee against jailbreaking attacks, but it relies on a strict "k-unstable" assumption that rarely holds in practice. This strong assumption can limit the trustworthiness of the provided safety certificate. In this work, we address this limitation by introducing a more realistic probabilistic framework, "(k, $\varepsilon$)-unstable," to certify defenses against diverse jailbreaking attacks, from gradient-based (GCG) to semantic (PAIR). We derive a new, data-informed lower bound on SmoothLLM's defense probability by incorporating empirical models of attack success, providing a more trustworthy and practical safety certificate. By introducing the notion of (k, $\varepsilon$)-unstable, our framework provides practitioners with actionable safety guarantees, enabling them to set certification thresholds that better reflect the real-world behavior of LLMs. Ultimately, this work contributes a practical and theoretically-grounded mechanism to make LLMs more resistant to the exploitation of their safety alignments, a critical challenge in secure AI deployment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes