LG AI MLOct 5, 2023

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Alexander Robey, Eric Wong, Hamed Hassani, George J. Pappas

arXiv:2310.03684v448.7523 citationsh-index: 21Has Code

Originality Incremental advance

AI Analysis

This addresses a critical security problem for users of widely-deployed LLMs like GPT and Llama, though it is an incremental improvement in defense mechanisms.

The paper tackles the vulnerability of large language models to jailbreaking attacks by proposing SmoothLLM, a defense algorithm that perturbs input prompts to detect adversarial inputs, achieving state-of-the-art robustness across multiple LLMs and attack types.

Despite efforts to align large language models (LLMs) with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. Across a range of popular LLMs, SmoothLLM sets the state-of-the-art for robustness against the GCG, PAIR, RandomSearch, and AmpleGCG jailbreaks. SmoothLLM is also resistant against adaptive GCG attacks, exhibits a small, though non-negligible trade-off between robustness and nominal performance, and is compatible with any LLM. Our code is publicly available at \url{https://github.com/arobey1/smooth-llm}.

View on arXiv PDF Code

Similar