CRAIJan 27, 2025

Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs

arXiv:2501.16534v22 citationsh-index: 9
Originality Incremental advance
AI Analysis

This addresses vulnerabilities in aligned LLMs for safety-critical applications, offering an efficient method to model and potentially mitigate jailbreaking attacks, though it is incremental as it builds on existing jailbreak techniques.

The paper tackles the problem of jailbreak attacks on aligned large language models (LLMs) by extracting surrogate safety classifiers from subsets of the model, achieving an F1 score above 80% with only 20% of the architecture and a 70% attack success rate with 50% of the model, compared to 22% when attacking the full LLM directly.

Alignment in large language models (LLMs) is used to enforce guidelines such as safety. Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe outputs. In this paper, we introduce and evaluate a new technique for jailbreak attacks. We observe that alignment embeds a safety classifier in the LLM responsible for deciding between refusal and compliance, and seek to extract an approximation of this classifier: a surrogate classifier. To this end, we build candidate classifiers from subsets of the LLM. We first evaluate the degree to which candidate classifiers approximate the LLM's safety classifier in benign and adversarial settings. Then, we attack the candidates and measure how well the resulting adversarial inputs transfer to the LLM. Our evaluation shows that the best candidates achieve accurate agreement (an F1 score above 80%) using as little as 20% of the model architecture. Further, we find that attacks mounted on the surrogate classifiers can be transferred to the LLM with high success. For example, a surrogate using only 50% of the Llama 2 model achieved an attack success rate (ASR) of 70% with half the memory footprint and runtime -- a substantial improvement over attacking the LLM directly, where we only observed a 22% ASR. These results show that extracting surrogate classifiers is an effective and efficient means for modeling (and therein addressing) the vulnerability of aligned models to jailbreaking attacks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes