CRAIFeb 14, 2025

Fast Proxies for LLM Robustness Evaluation

arXiv:2502.10487v14 citationsh-index: 12
Originality Incremental advance
AI Analysis

This provides a more efficient method for evaluating LLM robustness, which is crucial for safe deployment, though it is incremental as it builds on existing red-teaming approaches.

The paper tackles the problem of expensive LLM robustness evaluation by showing that fast proxy metrics, such as direct prompting and embedding-space attacks, can predict real-world robustness with high correlation (up to r_s=0.94) while reducing computational cost by three orders of magnitude.

Evaluating the robustness of LLMs to adversarial attacks is crucial for safe deployment, yet current red-teaming methods are often prohibitively expensive. We compare the ability of fast proxy metrics to predict the real-world robustness of an LLM against a simulated attacker ensemble. This allows us to estimate a model's robustness to computationally expensive attacks without requiring runs of the attacks themselves. Specifically, we consider gradient-descent-based embedding-space attacks, prefilling attacks, and direct prompting. Even though direct prompting in particular does not achieve high ASR, we find that it and embedding-space attacks can predict attack success rates well, achieving $r_p=0.87$ (linear) and $r_s=0.94$ (Spearman rank) correlations with the full attack ensemble while reducing computational cost by three orders of magnitude.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes