CLNov 11, 2025

A methodological analysis of prompt perturbations and their effect on attack success rates

arXiv:2511.10686v1h-index: 4Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of evaluating LLM security for researchers and practitioners, but it is incremental as it builds on existing attack analysis methods.

The study investigated how different alignment methods (SFT, DPO, RLHF) affect LLMs' susceptibility to prompt attacks, finding that small prompt modifications significantly change Attack Success Rates (ASR) and that existing benchmarks may not fully reveal vulnerabilities.

This work aims to investigate how different Large Language Models (LLMs) alignment methods affect the models' responses to prompt attacks. We selected open source models based on the most common alignment methods, namely, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Reinforcement Learning with Human Feedback (RLHF). We conducted a systematic analysis using statistical methods to verify how sensitive the Attack Success Rate (ASR) is when we apply variations to prompts designed to elicit inappropriate content from LLMs. Our results show that even small prompt modifications can significantly change the Attack Success Rate (ASR) according to the statistical tests we run, making the models more or less susceptible to types of attack. Critically, our results demonstrate that running existing 'attack benchmarks' alone may not be sufficient to elicit all possible vulnerabilities of both models and alignment methods. This paper thus contributes to ongoing efforts on model attack evaluation by means of systematic and statistically-based analyses of the different alignment methods and how sensitive their ASR is to prompt variation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes