Towards the Worst-case Robustness of Large Language Models
This addresses the problem of adversarial attacks on large language models, which is critical for AI safety, but the work is incremental as it builds on existing robustness frameworks.
The paper tackles the vulnerability of large language models to adversarial attacks by studying worst-case robustness, finding that most deterministic defenses have nearly 0% robustness, and proposes a method to certify robustness with bounds such as an average ℓ0 perturbation of 2.02.
Recent studies have revealed the vulnerability of large language models to adversarial attacks, where adversaries craft specific input sequences to induce harmful, violent, private, or incorrect outputs. In this work, we study their worst-case robustness, i.e., whether an adversarial example exists that leads to such undesirable outputs. We upper bound the worst-case robustness using stronger white-box attacks, indicating that most current deterministic defenses achieve nearly 0\% worst-case robustness. We propose a general tight lower bound for randomized smoothing using fractional knapsack solvers or 0-1 knapsack solvers, and using them to bound the worst-case robustness of all stochastic defenses. Based on these solvers, we provide theoretical lower bounds for several previous empirical defenses. For example, we certify the robustness of a specific case, smoothing using a uniform kernel, against \textit{any possible attack} with an average $\ell_0$ perturbation of 2.02 or an average suffix length of 6.41.