CR AIFeb 14, 2025

Fast Proxies for LLM Robustness Evaluation

Tim Beyer, Jan Schuchardt, Leo Schwinn, Stephan Günnemann

arXiv:2502.10487v110.44 citationsh-index: 12

Originality Incremental advance

AI Analysis

This provides a more efficient method for evaluating LLM robustness, which is crucial for safe deployment, though it is incremental as it builds on existing red-teaming approaches.

The paper tackles the problem of expensive LLM robustness evaluation by showing that fast proxy metrics, such as direct prompting and embedding-space attacks, can predict real-world robustness with high correlation (up to r_s=0.94) while reducing computational cost by three orders of magnitude.

Evaluating the robustness of LLMs to adversarial attacks is crucial for safe deployment, yet current red-teaming methods are often prohibitively expensive. We compare the ability of fast proxy metrics to predict the real-world robustness of an LLM against a simulated attacker ensemble. This allows us to estimate a model's robustness to computationally expensive attacks without requiring runs of the attacks themselves. Specifically, we consider gradient-descent-based embedding-space attacks, prefilling attacks, and direct prompting. Even though direct prompting in particular does not achieve high ASR, we find that it and embedding-space attacks can predict attack success rates well, achieving $r_p=0.87$ (linear) and $r_s=0.94$ (Spearman rank) correlations with the full attack ensemble while reducing computational cost by three orders of magnitude.

View on arXiv PDF

Similar