LGAICLCRJan 14, 2025

Gandalf the Red: Adaptive Security for LLMs

arXiv:2501.07927v37 citationsh-index: 10ICML
Originality Incremental advance
AI Analysis

This work addresses security and usability challenges for LLM application developers, offering incremental improvements through a new evaluation framework and dataset.

The paper tackled the problem of evaluating defenses against prompt attacks in LLM applications by addressing overlooked factors like dynamic adversarial behavior and usability penalties, resulting in the introduction of Gandalf, a crowd-sourced platform that generated a dataset of 279k prompt attacks and revealed strategies for balancing security and utility.

Current evaluations of defenses against prompt attacks in large language model (LLM) applications often overlook two critical factors: the dynamic nature of adversarial behavior and the usability penalties imposed on legitimate users by restrictive defenses. We propose D-SEC (Dynamic Security Utility Threat Model), which explicitly separates attackers from legitimate users, models multi-step interactions, and expresses the security-utility in an optimizable form. We further address the shortcomings in existing evaluations by introducing Gandalf, a crowd-sourced, gamified red-teaming platform designed to generate realistic, adaptive attack. Using Gandalf, we collect and release a dataset of 279k prompt attacks. Complemented by benign user data, our analysis reveals the interplay between security and utility, showing that defenses integrated in the LLM (e.g., system prompts) can degrade usability even without blocking requests. We demonstrate that restricted application domains, defense-in-depth, and adaptive defenses are effective strategies for building secure and useful LLM applications.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes